Since upgrading to 18.08.7 I have 2 systems having issues with associations being added to the database but slurmctld is not seeing them. The issue is not persistent because restarting slurmctld solves the issue. And it does not occur for a while. I have a system currently in the state with the accounts being created not being picked up with slurmctld association. Is there anyone available to help troubleshoot this issue? This problem had been happening for a while and I have added the debug statements in the assoc_mgr.c program as recommended but I have not been able to see any root cause. Now the problem has manifested to a new cluster recently upgraded and we would like for this to be resolved as soon as possible. Please advice.
Hi Jackie, I'd like to have you send a couple pieces of information for me. We have seen similar behavior previously where something was happening to clear out information in the database that is needed to talk to slurmctld. Can you get into the database and run this query for me? select control_host, control_port, name, rpc_version from cluster_table; Similarly, I'd like to have you run this command and send the output: sacctmgr show cluster You mention that the problem goes away when you restart slurmctld. I would be interested to see if the output of the above commands is any different right after a restart. You also say that the problem doesn't happen for a while after a restart. Are there new users added during this time, or is it just that the users who were added before start working and things are ok until you try to add someone new? If you can also send the slurmctld and slurmdbd logs from the time you tried to add a new user, along with the username I can search for, that would be helpful. Thanks, Ben
See below: On Tue, Aug 13, 2019 at 1:55 PM <bugs@schedmd.com> wrote: > *Comment # 2 <https://bugs.schedmd.com/show_bug.cgi?id=7570#c2> on bug > 7570 <https://bugs.schedmd.com/show_bug.cgi?id=7570> from Ben Roberts > <ben@schedmd.com> * > > Hi Jackie, > > I'd like to have you send a couple pieces of information for me. We have seen > similar behavior previously where something was happening to clear out > information in the database that is needed to talk to slurmctld. Can you get > into the database and run this query for me? > > select control_host, control_port, name, rpc_version from cluster_table; > > 2 different servers behaving the same way and I am running the commands on each for you to see: *Master.scs - * select control_host, control_port, name, rpc_version from cluster_table; +--------------+--------------+------------+-------------+ | control_host | control_port | name | rpc_version | +--------------+--------------+------------+-------------+ | | 0 | perceus-00 | 8448 | +--------------+--------------+------------+-------------+ *1 row in set (0.00 sec)* *Master.brc - * select control_host, control_port, name, rpc_version from cluster_table; +--------------+--------------+-----------------+-------------+ | control_host | control_port | name | rpc_version | +--------------+--------------+-----------------+-------------+ | | 0 | 0-a0-d1-ec-bc-c | 0 | | 10.0.0.10 | 6817 | brc | 8448 | | | 0 | master | 11 | | | 0 | master.brc | 11 | +--------------+--------------+-----------------+-------------+ 4 rows in set (0.00 sec) > Similarly, I'd like to have you run this command and send the output: > sacctmgr show cluster > > > *Master.scs - * sacctmgr show cluster Cluster ControlHost ControlPort RPC Share GrpJobs GrpTRES GrpSubmit MaxJobs MaxTRES MaxSubmit MaxWall QOS Def QOS ---------- --------------- ------------ ----- --------- ------- ------------- --------- ------- ------------- --------- ----------- -------------------- --------- perceus-00 0 8448 1 mako_debug,normal *Master.brc - * sacctmgr show cluster Cluster ControlHost ControlPort RPC Share GrpJobs GrpTRES GrpSubmit MaxJobs MaxTRES MaxSubmit MaxWall QOS Def QOS ---------- --------------- ------------ ----- --------- ------- ------------- --------- ------- ------------- --------- ----------- -------------------- --------- brc 10.0.0.10 6817 8448 1 > You mention that the problem goes away when you restart slurmctld. I would be > interested to see if the output of the above commands is any different right > after a restart. > > Yes it is. I have not ran these commands before and after a restart and never show them changed as far as the sacctmgr show cluster one is concerned. The database content not sure. > You also say that the problem doesn't happen for a while > after a restart. > > No it does not and there is no logs reporting any issues. I have increased debug levels before and nothing is reported in the slurmctld.log nor slurmdbd.log regarding assoc_mgr.c debug lines. > Are there new users added during this time, or is it just > that the users who were added before start working and things are ok until you > try to add someone new? > > Some were new and one was adding an additional association to an existing user. > > If you can also send the slurmctld and slurmdbd logs from the time you tried to > add a new user, along with the username I can search for, that would be > helpful. > > The logs are a bit tricky because I don't have the specific time these were created. I can give you all of the logs for August for BRC since August 5th and for scs it would be for the past week. Will that work? > Thanks, > Ben > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > >
Created attachment 11217 [details] Tar file of all slurm logs for August
Created attachment 11218 [details] SCS slurm log files for August tar format
Thank you for that data. It looks like there might be two different things happening on the different systems. The 'brc' system looks like it has the correct information to be able to communicate appearing in the database. However when I look at the logs for that system I see that on the 24th of July there were the following entries: [2019-07-24T15:22:14.782] error: CONN:7 Your user doesn't have privilege to perform this action [2019-07-24T15:22:14.782] error: CONN:7 Security violation, DBD_ADD_ASSOCS I tried to see what the slurmctld logs said for that time period, but they didn't go back far enough. If this is the problem on this system then it looks like something I've seen previously where the slurm.conf and slurmdbd.conf have different SlurmUser's specified. Do you know if July 24th sounds like a reasonable time for the last attempt to add a user? Can you grep for 'SlurmUser' in both your slurm.conf and slurmdbd.conf to see if they are different? For the 'scs' system, it looks like could be the issue I was mentioning earlier where the database is missing the information it needs to be able to talk with the slurm controller. You said that you never see a control host or port in the 'scontrol show cluster' output on 'scs' after a reboot. How soon after a reboot have you checked that output? I tried looking through the logs for this system for what I thought might be relevant errors, but didn't see them. Would it be possible for you to try adding a test user to the system as it is right now to narrow down a time and username to look at in the logs? Also, if you're able to restart slurmctld I would like to see how things look when you run 'scontrol show controller' and the database query right after the restart. Thanks, Ben
Created attachment 11220 [details] slurmscs-out.tar For brc during the July time that system was running 17.11 and not 18.08 and it was not experiencing the problem then. scs - /etc/slurm/slurm.conf:*SlurmUser*=slurm /etc/slurm/slurmdbd.conf:*SlurmUser*=slurm brc - /etc/slurm/slurm.conf:SlurmUser=slurm /etc/slurm/slurmdbd.conf:SlurmUser=slurm ok I just created myself an account in account=test See attached log file for what happened. After that I restarted slurmctld and ran sacctmgr show cluster Cluster ControlHost ControlPort RPC Share GrpJobs GrpTRES GrpSubmit MaxJobs MaxTRES MaxSubmit MaxWall QOS Def QOS ---------- --------------- ------------ ----- --------- ------- ------------- --------- ------- ------------- --------- ----------- -------------------- --------- perceus-00 10.0.0.11 6817 8448 1 mako_debug,normal which is the same as before the restart. I had not seen the User privilege errors before. This is the first time I have noticed that. Also I do not see 'scontrol show controller' as an option in this version of slurm. On Tue, Aug 13, 2019 at 3:43 PM <bugs@schedmd.com> wrote: > *Comment # 8 <https://bugs.schedmd.com/show_bug.cgi?id=7570#c8> on bug > 7570 <https://bugs.schedmd.com/show_bug.cgi?id=7570> from Ben Roberts > <ben@schedmd.com> * > > Thank you for that data. It looks like there might be two different things > happening on the different systems. The 'brc' system looks like it has the > correct information to be able to communicate appearing in the database. > However when I look at the logs for that system I see that on the 24th of July > there were the following entries: > [2019-07-24T15:22:14.782] error: CONN:7 Your user doesn't have privilege to > perform this action > [2019-07-24T15:22:14.782] error: CONN:7 Security violation, DBD_ADD_ASSOCS > > > I tried to see what the slurmctld logs said for that time period, but they > didn't go back far enough. If this is the problem on this system then it looks > like something I've seen previously where the slurm.conf and slurmdbd.conf have > different SlurmUser's specified. Do you know if July 24th sounds like a > reasonable time for the last attempt to add a user? Can you grep for > 'SlurmUser' in both your slurm.conf and slurmdbd.conf to see if they are > different? > > > > For the 'scs' system, it looks like could be the issue I was mentioning earlier > where the database is missing the information it needs to be able to talk with > the slurm controller. You said that you never see a control host or port in > the 'scontrol show cluster' output on 'scs' after a reboot. How soon after a > reboot have you checked that output? > > I tried looking through the logs for this system for what I thought might be > relevant errors, but didn't see them. Would it be possible for you to try > adding a test user to the system as it is right now to narrow down a time and > username to look at in the logs? Also, if you're able to restart slurmctld I > would like to see how things look when you run 'scontrol show controller' and > the database query right after the restart. > > Thanks, > Ben > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > >
Created attachment 11232 [details] Patch to maintain existing database connection Hi Jackie, Thanks for sending that information and the logs. From the slurmdbd logs I can see that there are entries showing that new connection requests coming in, which cause the existing connections to be closed. The logs look like this when a new connection is opened: /var/log/slurm/slurmdbd.log:[2019-08-13T15:50:00.032] debug2: Opened connection 15 from 10.0.2.2 /var/log/slurm/slurmdbd.log:[2019-08-13T15:50:00.032] debug: REQUEST_PERSIST_INIT: CLUSTER:perceus-00 VERSION:8448 UID:44678 IP:10.0.2.2 CONN:15 /var/log/slurm/slurmdbd.log:[2019-08-13T15:50:00.032] debug2: acct_storage_p_get_connection: request new connection 1 /var/log/slurm/slurmdbd.log:[2019-08-13T15:50:00.032] debug2: Attempting to connect to localhost:3306 And a little while later the connection is closed with logs like this: /var/log/slurm/slurmdbd.log:[2019-08-13T15:50:02.471] debug2: DBD_FINI: CLOSE:1 COMMIT:0 /var/log/slurm/slurmdbd.log:[2019-08-13T15:50:02.471] debug2: persistent connection is closed /var/log/slurm/slurmdbd.log:[2019-08-13T15:50:02.471] debug2: Closed connection 15 uid(44678) We've recently run into a similar scenario and a patch was created to address this issue. The patch changes the behavior to close the new connection and leave the established one alone so that communication can continue to occur between slurmctld and slurmdbd. I'm attaching this patch for you to try out. After it has been applied you would need to restart slurmdbd. The patch should change the behavior, but I do think it's strange that there are so many new connection requests coming in. In just under 2 minutes you showed in the logs there are 28 new connection requests. I assume you're not restarting slurmctld that often. Do you know what might be causing this many new connections to the database? The connections I see are coming from 3 IPs repeatedly: 10.0.2.2 10.0.0.10 10.0.0.11 I'm sorry, I must have been thinking of something else while I was typing 'scontrol show controller'. What I was interested in was the 'scontrol show cluster' output, which looks like it has valid data for the ControlHost and ControlPort right after the restart. This also makes it look like the bug we've encountered before because the host and port are there right after a restart and then after things have been running a while the connection information is dropped. Let me know if you have questions about any of this. Thanks, Ben
The process you are possibly seeing is telegraf which is for grafana to gather data for data collection and graphs. Do you have any other sites using grafana or other applications polling the database regularly that have similar issues? And what else would be the fix? Thanks Jackie On Wed, Aug 14, 2019 at 9:49 AM <bugs@schedmd.com> wrote: > *Comment # 11 <https://bugs.schedmd.com/show_bug.cgi?id=7570#c11> on bug > 7570 <https://bugs.schedmd.com/show_bug.cgi?id=7570> from Ben Roberts > <ben@schedmd.com> * > > Created attachment 11232 [details] <https://bugs.schedmd.com/attachment.cgi?id=11232&action=diff> [details] <https://bugs.schedmd.com/attachment.cgi?id=11232&action=edit> > Patch to maintain existing database connection > > Hi Jackie, > > Thanks for sending that information and the logs. From the slurmdbd logs I can > see that there are entries showing that new connection requests coming in, > which cause the existing connections to be closed. The logs look like this > when a new connection is opened: > > /var/log/slurm/slurmdbd.log:[2019-08-13T15:50:00.032] debug2: Opened connection > 15 from 10.0.2.2 > /var/log/slurm/slurmdbd.log:[2019-08-13T15:50:00.032] debug: > REQUEST_PERSIST_INIT: CLUSTER:perceus-00 VERSION:8448 UID:44678 IP:10.0.2.2 > CONN:15 > /var/log/slurm/slurmdbd.log:[2019-08-13T15:50:00.032] debug2: > acct_storage_p_get_connection: request new connection 1 > /var/log/slurm/slurmdbd.log:[2019-08-13T15:50:00.032] debug2: Attempting to > connect to localhost:3306 > > > > And a little while later the connection is closed with logs like this: > > /var/log/slurm/slurmdbd.log:[2019-08-13T15:50:02.471] debug2: DBD_FINI: CLOSE:1 > COMMIT:0 > /var/log/slurm/slurmdbd.log:[2019-08-13T15:50:02.471] debug2: persistent > connection is closed > /var/log/slurm/slurmdbd.log:[2019-08-13T15:50:02.471] debug2: Closed connection > 15 uid(44678) > > > > We've recently run into a similar scenario and a patch was created to address > this issue. The patch changes the behavior to close the new connection and > leave the established one alone so that communication can continue to occur > between slurmctld and slurmdbd. I'm attaching this patch for you to try out. > After it has been applied you would need to restart slurmdbd. > > The patch should change the behavior, but I do think it's strange that there > are so many new connection requests coming in. In just under 2 minutes you > showed in the logs there are 28 new connection requests. I assume you're not > restarting slurmctld that often. Do you know what might be causing this many > new connections to the database? The connections I see are coming from 3 IPs > repeatedly: > 10.0.2.2 > 10.0.0.10 > 10.0.0.11 > > I'm sorry, I must have been thinking of something else while I was typing > 'scontrol show controller'. What I was interested in was the 'scontrol show > cluster' output, which looks like it has valid data for the ControlHost and > ControlPort right after the restart. This also makes it look like the bug > we've encountered before because the host and port are there right after a > restart and then after things have been running a while the connection > information is dropped. > > Let me know if you have questions about any of this. > > Thanks, > Ben > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > >
Here is the process that we are running to collect the grafana data: 1. We are running sinfo, sacct, squeue within a telegraf (cron like) program every 60s. to gather data per partitions which we currently have 27 partitions on scs and 8 partitions on brc to collect for grafana. Making the connections are only done through the slurm commands. Question: Isn't it normal behavior for multiple request from the commands to come through. Users could have scripts that can run these commands and this should not close the connection between slurmdbd and slurmctld. What changed? In looking over the logs and the statement regarding the slurmdbd session closing I am not seeing the same collation: ie. from slurmdbd.logs [2019-08-13T09:19:51.663] debug: REQUEST_PERSIST_INIT: CLUSTER:perceus-00 VERSION:8448 UID:497 IP:10.0.0.11 CONN:7 There is no corresponding connection close statement on this process until I restarted slurmctld after I attempted to add the association and it failed. [2019-08-13T15:55:39.579] debug: REQUEST_PERSIST_INIT: CLUSTER:perceus-00 VERSION:8448 UID:497 IP:10.0.0.11 CONN:9 When did the database loose the connection between slurmctld and slurmdbd? In the slurmctld I see the following: [2019-08-13T15:51:12.014] error: Error connecting, bad data: family = 0, port = 0 [2019-08-13T15:52:24.133] error: Error connecting, bad data: family = 0, port = 0 [2019-08-13T15:53:41.140] error: Error connecting, bad data: family = 0, port = 0 [2019-08-13T15:55:36.942] Terminate signal (SIGINT or SIGTERM) received - restart slurmctld by me What are these messages from I am getting in slurmctld as well? [2019-08-13T15:55:40.904] error: Can't find parent id 19385 for assoc 20700, this should never happen. [2019-08-13T15:54:03.775] error: A non superuser 44529 tried to complete batch JobId=20441374 Thanks Jackie On Wed, Aug 14, 2019 at 9:49 AM <bugs@schedmd.com> wrote: > *Comment # 11 <https://bugs.schedmd.com/show_bug.cgi?id=7570#c11> on bug > 7570 <https://bugs.schedmd.com/show_bug.cgi?id=7570> from Ben Roberts > <ben@schedmd.com> * > > Created attachment 11232 [details] <https://bugs.schedmd.com/attachment.cgi?id=11232&action=diff> [details] <https://bugs.schedmd.com/attachment.cgi?id=11232&action=edit> > Patch to maintain existing database connection > > Hi Jackie, > > Thanks for sending that information and the logs. From the slurmdbd logs I can > see that there are entries showing that new connection requests coming in, > which cause the existing connections to be closed. The logs look like this > when a new connection is opened: > > /var/log/slurm/slurmdbd.log:[2019-08-13T15:50:00.032] debug2: Opened connection > 15 from 10.0.2.2 > /var/log/slurm/slurmdbd.log:[2019-08-13T15:50:00.032] debug: > REQUEST_PERSIST_INIT: CLUSTER:perceus-00 VERSION:8448 UID:44678 IP:10.0.2.2 > CONN:15 > /var/log/slurm/slurmdbd.log:[2019-08-13T15:50:00.032] debug2: > acct_storage_p_get_connection: request new connection 1 > /var/log/slurm/slurmdbd.log:[2019-08-13T15:50:00.032] debug2: Attempting to > connect to localhost:3306 > > > > And a little while later the connection is closed with logs like this: > > /var/log/slurm/slurmdbd.log:[2019-08-13T15:50:02.471] debug2: DBD_FINI: CLOSE:1 > COMMIT:0 > /var/log/slurm/slurmdbd.log:[2019-08-13T15:50:02.471] debug2: persistent > connection is closed > /var/log/slurm/slurmdbd.log:[2019-08-13T15:50:02.471] debug2: Closed connection > 15 uid(44678) > > > > We've recently run into a similar scenario and a patch was created to address > this issue. The patch changes the behavior to close the new connection and > leave the established one alone so that communication can continue to occur > between slurmctld and slurmdbd. I'm attaching this patch for you to try out. > After it has been applied you would need to restart slurmdbd. > > The patch should change the behavior, but I do think it's strange that there > are so many new connection requests coming in. In just under 2 minutes you > showed in the logs there are 28 new connection requests. I assume you're not > restarting slurmctld that often. Do you know what might be causing this many > new connections to the database? The connections I see are coming from 3 IPs > repeatedly: > 10.0.2.2 > 10.0.0.10 > 10.0.0.11 > > I'm sorry, I must have been thinking of something else while I was typing > 'scontrol show controller'. What I was interested in was the 'scontrol show > cluster' output, which looks like it has valid data for the ControlHost and > ControlPort right after the restart. This also makes it look like the bug > we've encountered before because the host and port are there right after a > restart and then after things have been running a while the connection > information is dropped. > > Let me know if you have questions about any of this. > > Thanks, > Ben > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > >
See responses below: > Here is the process that we are running to collect the grafana data: > > 1. We are running sinfo, sacct, squeue within a telegraf (cron like) > program every 60s. to gather data per partitions which we currently have > 27 partitions on scs and 8 partitions on brc to collect for > grafana. Making the connections are only done through the slurm > commands. > > Question: Isn't it normal behavior for multiple request from the commands > to come through. Users could have scripts that can run these commands and > this should not close the connection between slurmdbd and slurmctld. What > changed? If the telegraf program is just running slurm commands it should not be causing the behavior I'm observing/describing. It is normal that client commands come through without interrupting the database connection. I can see that there is at least one other site using grafana and they haven't reported this behavior. Is it possible there are other things that might be connecting to the database directly? > In looking over the logs and the statement regarding the slurmdbd session > closing I am not seeing the same collation: > > ie. > from slurmdbd.logs > > [2019-08-13T09:19:51.663] debug: REQUEST_PERSIST_INIT: CLUSTER:perceus-00 VERSION:8448 UID:497 IP:10.0.0.11 CONN:7 > There is no corresponding connection close statement on this process until > I restarted slurmctld after I attempted to add the association and it > failed. > > [2019-08-13T15:55:39.579] debug: REQUEST_PERSIST_INIT: CLUSTER:perceus-00 VERSION:8448 UID:497 IP:10.0.0.11 CONN:9 These two log entries look like they're for different connections. The UID and IP are the same between them, but the connection identifier (CONN) is different. During the time I have logs for I did see other connections being established from 10.0.0.11. I would need more logs for the time in between to be able to comment on this further. > When did the database loose the connection between slurmctld and slurmdbd? I would also need more of the log files to be able to see when the initial connection was lost exactly. I don't see that the slurmctld logs you sent cover a time where there was a restart. > In the slurmctld I see the following: > > [2019-08-13T15:51:12.014] error: Error connecting, bad data: family = 0, port = 0 > > [2019-08-13T15:52:24.133] error: Error connecting, bad data: family = 0, port = 0 > > [2019-08-13T15:53:41.140] error: Error connecting, bad data: family = 0, port = 0 > > [2019-08-13T15:55:36.942] Terminate signal (SIGINT or SIGTERM) received - restart slurmctld by me These error messages are related to the connection information being lost. It is in the slurm_open_stream function where it's trying to open the connection to the database. You can see it in this section of the code: https://github.com/SchedMD/slurm/blob/432ebf2dc67a503c18e770df3f047c14943510b6/src/common/slurm_protocol_socket.c#L497-L501 > What are these messages from I am getting in slurmctld as well? > > [2019-08-13T15:55:40.904] error: Can't find parent id 19385 for assoc 20700, this should never happen. > > [2019-08-13T15:54:03.775] error: A non superuser 44529 tried to complete batch JobId=20441374 It's possible these error messages are also related to this. The first one is when it's trying to find information about an association, which could not have been sent over correctly if a new user was added with sacctmgr but isn't sent correctly to slurmctld. The second message I'm not as sure about. Can you identify the user it's talking about by running 'id 44529' as root to see which user it is? For your reference you can read more details about the investigation of the case that is similar to yours in Bug 5213. The most relevant details start showing up around here: https://bugs.schedmd.com/show_bug.cgi?id=5213#c90 It sounds like you're hesitant to try the patch I sent from that ticket. Can I ask if there's a reason for your hesitancy? Thanks, Ben
On Wed, Aug 14, 2019 at 1:17 PM <bugs@schedmd.com> wrote: > *Comment # 14 <https://bugs.schedmd.com/show_bug.cgi?id=7570#c14> on bug > 7570 <https://bugs.schedmd.com/show_bug.cgi?id=7570> from Ben Roberts > <ben@schedmd.com> * > > See responses below: > > Here is the process that we are running to collect the grafana data: > > > > 1. We are running sinfo, sacct, squeue within a telegraf (cron like) > > program every 60s. to gather data per partitions which we currently have > > 27 partitions on scs and 8 partitions on brc to collect for > > grafana. Making the connections are only done through the slurm > > commands. > > > > Question: Isn't it normal behavior for multiple request from the commands > > to come through. Users could have scripts that can run these commands and > > this should not close the connection between slurmdbd and slurmctld. What > > changed? > > If the telegraf program is just running slurm commands it should not be causing > the behavior I'm observing/describing. It is normal that client commands come > through without interrupting the database connection. I can see that there is > at least one other site using grafana and they haven't reported this behavior. > Is it possible there are other things that might be connecting to the database > directly? > > Nothing is connecting to the database directly. Only an admin would do it if its necessary to check things but that was not done by me at all. > > In looking over the logs and the statement regarding the slurmdbd session > > closing I am not seeing the same collation: > > > > ie. > > from slurmdbd.logs > > > > [2019-08-13T09:19:51.663] debug: REQUEST_PERSIST_INIT: CLUSTER:perceus-00 > VERSION:8448 UID:497 IP:10.0.0.11 CONN:7> There is no corresponding connection close statement on this process until > > I restarted slurmctld after I attempted to add the association and it > > failed. > > > > [2019-08-13T15:55:39.579] debug: REQUEST_PERSIST_INIT: CLUSTER:perceus-00 > VERSION:8448 UID:497 IP:10.0.0.11 CONN:9 > > These two log entries look like they're for different connections. The UID and > IP are the same between them, but the connection identifier (CONN) is > different. During the time I have logs for I did see other connections being > established from 10.0.0.11. I would need more logs for the time in between to > be able to comment on this further. > > > The first connection was started earlier that morning and it was just to show you a connection to slurmdbd earlier in the morning with no disconnect of that session in any of the logs. So I don't see where the connection dropped. The second connection was started after I restarted slurmctld when I ran the test you asked me to do so we could collect the logs. > > When did the database loose the connection between slurmctld and slurmdbd? > > I would also need more of the log files to be able to see when the initial > connection was lost exactly. I don't see that the slurmctld logs you sent > cover a time where there was a restart. > > > All of the logs were sent to you and all are centrally logging to one hosts. > > In the slurmctld I see the following: > > > > [2019-08-13T15:51:12.014] error: Error connecting, bad data: family = 0, > port = 0> > > [2019-08-13T15:52:24.133] error: Error connecting, bad data: family = 0, > port = 0> > > [2019-08-13T15:53:41.140] error: Error connecting, bad data: family = 0, > port = 0> > > [2019-08-13T15:55:36.942] Terminate signal (SIGINT or SIGTERM) received - > restart slurmctld by me > > These error messages are related to the connection information being lost. It > is in the slurm_open_stream function where it's trying to open the connection > to the database. You can see it in this section of the code:https://github.com/SchedMD/slurm/blob/432ebf2dc67a503c18e770df3f047c14943510b6/src/common/slurm_protocol_socket.c#L497-L501 > > But if you look at the time it was after I created the account and before I restarted slurmctld. So why did the connection see bad data? > > What are these messages from I am getting in slurmctld as well? > > > > [2019-08-13T15:55:40.904] error: Can't find parent id 19385 for assoc > 20700, this should never happen.> > > [2019-08-13T15:54:03.775] error: A non superuser 44529 tried to complete > batch JobId=20441374 > > > It's possible these error messages are also related to this. The first one is > when it's trying to find information about an association, which could not have > been sent over correctly if a new user was added with sacctmgr but isn't sent > correctly to slurmctld. The second message I'm not as sure about. Can you > identify the user it's talking about by running 'id 44529' as root to see which > user it is? > > No the user that its reporting on is a valid user and has been in the scheduler for a while. not sure what complete batch tasks it is complaining about? [root@master ~]# sacct -j 20441374 -X -o 'jobid,user,partition' JobID User Partition ------------ --------- ---------- 20441374 gguidi cm1 [root@master ~]# id gguidi uid=44529(gguidi) gid=504(crd) groups=504(crd),606(lawrencium),627(mako),632(californium),40031(pc_exagraph) [root@master ~]# scontrol show assoc | grep gguidi UserName=*gguidi*(44529) DefAccount=pc_exagraph DefWckey= AdminLevel=None ClusterName=perceus-00 Account=pc_exagraph UserName=*gguidi*(44529) Partition=cm1 ID=19964 ClusterName=perceus-00 Account=pc_exagraph UserName=*gguidi*(44529) Partition=es1 ID=19622 ClusterName=perceus-00 Account=pc_exagraph UserName=*gguidi*(44529) Partition=cf1 ID=18674 ClusterName=perceus-00 Account=pc_exagraph UserName=*gguidi*(44529) Partition=mako ID=18673 ClusterName=perceus-00 Account=pc_exagraph UserName=*gguidi*(44529) Partition=lr_bigmem ID=18672 ClusterName=perceus-00 Account=pc_exagraph UserName=*gguidi*(44529) Partition=lr6 ID=18671 ClusterName=perceus-00 Account=pc_exagraph UserName=*gguidi*(44529) Partition=lr5 ID=18670 ClusterName=perceus-00 Account=pc_exagraph UserName=*gguidi*(44529) Partition=lr4 ID=18669 ClusterName=perceus-00 Account=pc_exagraph UserName=*gguidi*(44529) Partition=lr3 ID=18668 ClusterName=perceus-00 Account=pc_exagraph UserName=*gguidi*(44529) Partition=lr2 ID=18667 [root@master ~]# sacctmgr show transaction | grep 18668 2019-04-08T10:49:24 Add Associations root id_assoc=*18668* mod_time=1554745764+ 2019-04-08T10:49:32 Modify Associations root (id_assoc=*18668*) qos=',28,29,', delt+ [root@master ~]# sacctmgr show transaction | grep 18669 2019-04-08T10:49:25 Add Associations root id_assoc=*18669* mod_time=1554745765+ 2019-04-08T10:49:32 Modify Associations root (id_assoc=*18669*) qos=',28,29,', delt+ > > For your reference you can read more details about the investigation of the > case that is similar to yours in Bug 5213 <https://bugs.schedmd.com/show_bug.cgi?id=5213>. The most relevant details start > showing up around here:https://bugs.schedmd.com/show_bug.cgi?id=5213#c90 > > It sounds like you're hesitant to try the patch I sent from that ticket. Can I > ask if there's a reason for your hesitancy? > > No hesitancy just want to make sure we are fixing the correct problem. I am going on vac tomorrow until Wednesday and I want to deploy it but don't want to be called over my vacation time for the same issue. > > Thanks, > Ben > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > >
Hi Jackie, It sounds like you may have intended to send more information to me than made it. I got a tar file this morning called slurmscs-out.tar. I looked inside that tar file and there was a single file called slurm-testscs.info: $ tar tvf slurmscs-out.tar -rw-r--r-- scoggins/staff 131485 2019-08-13 18:01 slurm-testscs.info When I looked inside that file I can see that you grepped out all the log entries with a time stamp of "15:5" for today from the slurmctld.log and slurmdbd.log files. Then it looks like you paged through that log file with 'more': [root@master tmp]# grep 2019-08-13T15:5 /var/log/slurm/slurmctld.log /var/log/slurm/slurmdbd.log >> /tmp/slurmlogs.withissue [root@master tmp]# more /tmp/slurmlogs.withissue But when I look at the log entries that show up they only cover the time from 15:50:00 through 15:51:51: /var/log/slurm/slurmctld.log:[2019-08-13T15:50:00.929] _slurm_rpc_submit_batch_job: JobId=20441555 InitPrio=25064 usec=3065 ... /var/log/slurm/slurmctld.log:[2019-08-13T15:51:33.721] error: A non superuser 44529 tried to complete batch JobId=20441374 /var/log/slurm/slurmdbd.log:[2019-08-13T15:50:00.032] debug2: Opened connection 15 from 10.0.2.2 ... /var/log/slurm/slurmdbd.log:[2019-08-13T15:51:51.372] debug2: Closed connection 10 uid(295) It looks like your restart of slurmctld was around 15:55:xx and I don't have logs that cover that time period. The issue about a non superuser trying to complete a job sounds like a different issue and shouldn't be related to new users not making it from slurmdbd to slurmctld. Are you able to get the output of 'scontrol show job 20441374' still or is that job no longer on the system? It would probably be best to investigate that issue further in a separate ticket so it doesn't get lost in the communication issue. Thanks, Ben
I have applied the patch and will let you know if I see any further problems. Thanks Jackie
Hi Jackie, I'm sorry there was trouble getting the logs for the right period of time. I'm glad to hear you've applied the patch. I was waiting to hear if anything came up today, but it's been a while now and I haven't heard of a problem. I assume you've been testing adding new users today, is that right? If anything does come up after applying the patch I'm happy to look into it. Thanks, Ben
Thanks Ben. I could get the additional logs but I thought it best to try out the patch first. I’m on vacation next Wednesday next week. Others are adding accounts and will let me know if they see the problem. I’ll grab all of the logs if anything happens. Thanks Jackie Scoggins On Aug 15, 2019, at 3:07 PM, bugs@schedmd.com wrote: *Comment # 18 <https://bugs.schedmd.com/show_bug.cgi?id=7570#c18> on bug 7570 <https://bugs.schedmd.com/show_bug.cgi?id=7570> from Ben Roberts <ben@schedmd.com> * Hi Jackie, I'm sorry there was trouble getting the logs for the right period of time. I'm glad to hear you've applied the patch. I was waiting to hear if anything came up today, but it's been a while now and I haven't heard of a problem. I assume you've been testing adding new users today, is that right? If anything does come up after applying the patch I'm happy to look into it. Thanks, Ben ------------------------------ You are receiving this mail because: - You reported the bug.
Although Im out I see that the slurm accounts are still not going to the slurmctld. I just saw an account created and none of the group members were added to the controller. What steps would you like for me to take so you can troubleshoot this issue? Jackie On Thu, Aug 15, 2019 at 3:19 PM Jacqueline Scoggins <jscoggins@lbl.gov> wrote: > Thanks Ben. I could get the additional logs but I thought it best to try > out the patch first. I’m on vacation next Wednesday next week. Others are > adding accounts and will let me know if they see the problem. I’ll grab > all of the logs if anything happens. > > Thanks > > Jackie Scoggins > > On Aug 15, 2019, at 3:07 PM, bugs@schedmd.com wrote: > > *Comment # 18 <https://bugs.schedmd.com/show_bug.cgi?id=7570#c18> on bug > 7570 <https://bugs.schedmd.com/show_bug.cgi?id=7570> from Ben Roberts > <ben@schedmd.com> * > > Hi Jackie, > > I'm sorry there was trouble getting the logs for the right period of time. I'm > glad to hear you've applied the patch. I was waiting to hear if anything came > up today, but it's been a while now and I haven't heard of a problem. I assume > you've been testing adding new users today, is that right? If anything does > come up after applying the patch I'm happy to look into it. > > Thanks, > Ben > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > >
Created attachment 11260 [details] slurm-7570.tar.bz2 I am seeing RPC's pending and max_rpc_cnt timeout during the time of the account creations. In reading the doc "if using this parameter to set max_rpc_cnt high enough that scheduling isn't always disabled, and low enough that the interactive workload can get through in a reasonable period of time. max_rpc_cnt needs to be below 256 (the default RPC thread limit). Running around the middle (150) may give you good results. NOTE: When increasing the amount of time spent in the backfill scheduling cycle, Slurm can be prevented from responding to client requests in a timely manner. To address this you can use *max_rpc_cnt* to specify a number of queued RPCs before the scheduler stops to respond to these requests." Could these timeouts be causing our issue and not processes opening and closing the database? [2019-08-15T15:59:49.873] sched: 256 pending RPCs at cycle end, consider configuring *max_rpc*_cnt [2019-08-15T16:12:43.517] SchedulerParameters=default_queue_depth=100, *max_rpc* _cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=2 [2019-08-15T16:20:24.688] SchedulerParameters=default_queue_depth=100, *max_rpc* _cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=2 [2019-08-15T16:33:19.790] sched: 256 pending RPCs at cycle end, consider configuring *max_rpc*_cnt [2019-08-15T16:38:40.814] sched: 256 pending RPCs at cycle end, consider configuring *max_rpc*_cnt I am going to attach logs from when I saw the association not get added to the controller. On Thu, Aug 15, 2019 at 4:07 PM Jacqueline Scoggins <jscoggins@lbl.gov> wrote: > Although Im out I see that the slurm accounts are still not going to the > slurmctld. I just saw an account created and none of the group members > were added to the controller. > > What steps would you like for me to take so you can troubleshoot this > issue? > > Jackie > > On Thu, Aug 15, 2019 at 3:19 PM Jacqueline Scoggins <jscoggins@lbl.gov> > wrote: > >> Thanks Ben. I could get the additional logs but I thought it best to try >> out the patch first. I’m on vacation next Wednesday next week. Others are >> adding accounts and will let me know if they see the problem. I’ll grab >> all of the logs if anything happens. >> >> Thanks >> >> Jackie Scoggins >> >> On Aug 15, 2019, at 3:07 PM, bugs@schedmd.com wrote: >> >> *Comment # 18 <https://bugs.schedmd.com/show_bug.cgi?id=7570#c18> on bug >> 7570 <https://bugs.schedmd.com/show_bug.cgi?id=7570> from Ben Roberts >> <ben@schedmd.com> * >> >> Hi Jackie, >> >> I'm sorry there was trouble getting the logs for the right period of time. I'm >> glad to hear you've applied the patch. I was waiting to hear if anything came >> up today, but it's been a while now and I haven't heard of a problem. I assume >> you've been testing adding new users today, is that right? If anything does >> come up after applying the patch I'm happy to look into it. >> >> Thanks, >> Ben >> >> ------------------------------ >> You are receiving this mail because: >> >> - You reported the bug. >> >>
Hi Jackie, To clarify, it looks like this patch was applied on your scs system. Is that right? One thing I would like to see is whether the connection information is still getting lost after applying the patch. Can I have you connect to the database for the scs cluster again and run this query: select control_host, control_port, name, rpc_version from cluster_table; I would also like to see the output of the following commands on the machine running slurmdbd: ls -l `which slurmdbd` ps aux | grep slurmdbd And on the machine running slurmctld I would like to see the following: sdiag scontrol show cluster scontrol show assoc | grep dongyingwu id 43537 The log messages you asked about related to the max_rpc_cnt could be something to look into, but should just result in a delay in having the user show up. The max_rpc_cnt controls how many requests slurm will allow to queue up before it will force the scheduler to stop to respond to the requests. Without the max_rpc_cnt defined it should just wait to service them once the scheduling cycle is done. I'll look into this a little more with the sdiag output I've asked for. Thanks, Ben
After taking a look at the binaries it looks like slurmdbd was not updated from when I rebuilt it on scs. I just remake it and updated all of the files. If I see this issue again I will try the commands you mentioned above. Thanks Jackie On Fri, Aug 16, 2019 at 9:59 AM <bugs@schedmd.com> wrote: > *Comment # 22 <https://bugs.schedmd.com/show_bug.cgi?id=7570#c22> on bug > 7570 <https://bugs.schedmd.com/show_bug.cgi?id=7570> from Ben Roberts > <ben@schedmd.com> * > > Hi Jackie, > > To clarify, it looks like this patch was applied on your scs system. Is that > right? One thing I would like to see is whether the connection information is > still getting lost after applying the patch. Can I have you connect to the > database for the scs cluster again and run this query: > > select control_host, control_port, name, rpc_version from cluster_table; > > I would also like to see the output of the following commands on the machine > running slurmdbd: > ls -l `which slurmdbd` > ps aux | grep slurmdbd > > And on the machine running slurmctld I would like to see the following: > sdiag > scontrol show cluster > scontrol show assoc | grep dongyingwu > id 43537 > > > The log messages you asked about related to the max_rpc_cnt could be something > to look into, but should just result in a delay in having the user show up. > The max_rpc_cnt controls how many requests slurm will allow to queue up before > it will force the scheduler to stop to respond to the requests. Without the > max_rpc_cnt defined it should just wait to service them once the scheduling > cycle is done. I'll look into this a little more with the sdiag output I've > asked for. > > Thanks, > Ben > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > >
Ok, I'm glad to hear that the patch just didn't work. I'll wait to hear how it goes. Thanks Jackie. Ben
Hi Jackie, I wanted to follow up and see if you were able to apply the patch successfully and whether it made a difference. Let me know how things look. Thanks, Ben
I did apply the patch. I was told by one of the admins that she saw an account added but did not get into slurmctld earlier this week but I have not confirmed that. I will have to get more information from her and let you know. Thanks Jackie On Thu, Aug 22, 2019 at 7:50 AM <bugs@schedmd.com> wrote: > *Comment # 25 <https://bugs.schedmd.com/show_bug.cgi?id=7570#c25> on bug > 7570 <https://bugs.schedmd.com/show_bug.cgi?id=7570> from Ben Roberts > <ben@schedmd.com> * > > Hi Jackie, > > I wanted to follow up and see if you were able to apply the patch successfully > and whether it made a difference. Let me know how things look. > > Thanks, > Ben > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > >
Hi Jackie, Were you able to get any additional details about a user still not being passed to slurmctld after being added to the database? I'm happy to look into this further if it's still an issue. Thanks, Ben
So far no one has complained. But I did restart the service on Wednesday so I can't say it was fixed completely. Lets try next week and see if no one complains the new accounts are not working. Thanks Jackie On Fri, Aug 23, 2019 at 3:39 PM <bugs@schedmd.com> wrote: > *Comment # 27 <https://bugs.schedmd.com/show_bug.cgi?id=7570#c27> on bug > 7570 <https://bugs.schedmd.com/show_bug.cgi?id=7570> from Ben Roberts > <ben@schedmd.com> * > > Hi Jackie, > > Were you able to get any additional details about a user still not being passed > to slurmctld after being added to the database? I'm happy to look into this > further if it's still an issue. > > Thanks, > Ben > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > >
Hi Jackie, I just wanted to check in to see how things have been looking with the patch applied. I also wanted to let you know that in Bug 5213 there was a follow up patch created for another issue that was found, related to the patch I provided earlier. The fix is for an issue where a slurmdbd connection was created without validating that there was a port listed. This could cause a seg fault under the right circumstances. This is an issue that has been around for a long time, but only became visible with the patch I provided earlier. I would recommend applying this patch as well to avoid the possibility of a crash. https://github.com/SchedMD/slurm/commit/e57b297ea1c4740ff69cbbe18df6abe3c8b50a74 Please let me know if you have questions. Thanks, Ben
Thanks for the update. I have not had to add any new users on lrc this week and the person who normally does it is out of the office until Monday next week. And I don't seem to see any from either system not being reported in both dbd and ctld. Jackie On Wed, Aug 28, 2019 at 11:09 AM <bugs@schedmd.com> wrote: > *Comment # 29 <https://bugs.schedmd.com/show_bug.cgi?id=7570#c29> on bug > 7570 <https://bugs.schedmd.com/show_bug.cgi?id=7570> from Ben Roberts > <ben@schedmd.com> * > > Hi Jackie, > > I just wanted to check in to see how things have been looking with the patch > applied. I also wanted to let you know that in Bug 5213 <https://bugs.schedmd.com/show_bug.cgi?id=5213> there was a follow up > patch created for another issue that was found, related to the patch I provided > earlier. The fix is for an issue where a slurmdbd connection was created > without validating that there was a port listed. This could cause a seg fault > under the right circumstances. This is an issue that has been around for a > long time, but only became visible with the patch I provided earlier. I would > recommend applying this patch as well to avoid the possibility of a crash.https://github.com/SchedMD/slurm/commit/e57b297ea1c4740ff69cbbe18df6abe3c8b50a74 > > Please let me know if you have questions. > > Thanks, > Ben > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > >
Hi Jackie, I'm lowering the priority of this ticket while we monitor the effect of the patch. Have you been able to verify that new users are recognized without a restart of slurmctld this past week? Thanks, Ben
Thanks for the update. On Thu, Sep 5, 2019 at 1:28 PM <bugs@schedmd.com> wrote: > Ben Roberts <ben@schedmd.com> changed bug 7570 > <https://bugs.schedmd.com/show_bug.cgi?id=7570> > What Removed Added > Severity 2 - High Impact 3 - Medium Impact > > *Comment # 31 <https://bugs.schedmd.com/show_bug.cgi?id=7570#c31> on bug > 7570 <https://bugs.schedmd.com/show_bug.cgi?id=7570> from Ben Roberts > <ben@schedmd.com> * > > Hi Jackie, > > I'm lowering the priority of this ticket while we monitor the effect of the > patch. Have you been able to verify that new users are recognized without a > restart of slurmctld this past week? > > Thanks, > Ben > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > >
Hi Jackie, I just wanted to follow up and see if the patches supplied in this ticket have addressed the issue you were seeing with new users. Let me know if this ticket is ok to close. Thanks, Ben
It’s ok to close it. Haven’t seen the issue since the patch. Thanks Jackie Scoggins On Sep 20, 2019, at 11:25 AM, bugs@schedmd.com wrote: *Comment # 33 <https://bugs.schedmd.com/show_bug.cgi?id=7570#c33> on bug 7570 <https://bugs.schedmd.com/show_bug.cgi?id=7570> from Ben Roberts <ben@schedmd.com> * Hi Jackie, I just wanted to follow up and see if the patches supplied in this ticket have addressed the issue you were seeing with new users. Let me know if this ticket is ok to close. Thanks, Ben ------------------------------ You are receiving this mail because: - You reported the bug.
Thanks Jackie, I'm glad that worked. Closing now. Ben