Ticket 5719

Summary:	Having to restart slurmctld to detect changes made using sacctmgr
Product:	Slurm	Reporter:	Greg Wickham <greg.wickham>
Component:	Database	Assignee:	Jason Booth <jbooth>
Status:	RESOLVED TIMEDOUT	QA Contact:
Severity:	4 - Minor Issue
Priority:	---	CC:	wfeinstein
Version:	17.11.5
Hardware:	Linux
OS:	Linux
Site:	KAUST	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---

Description Greg Wickham 2018-09-12 08:34:19 MDT

What's the correct procedure to have changes made using 'sacctmgr' available for 'slurmctld'.

I've had problems modifying a reservation using a newly created account:

  Error updating the reservation: Invalid account or account/partition combination specified
  slurm_update error: Invalid account or account/partition combination specified

and updating QOS "maxsubmitjobs" yet using 'srun' fails:

  srun: error: Unable to allocate resources: Job violates accounting/QOS policy (job submit limit, user's size and/or time limits)

Both the above were fixed by restarting 'slurmctld'.

What is the 'proper' way to ensure that what is changed using sacctmgr is available in slurmctld?

Thanks,

   -Greg

Comment 2 Jason Booth 2018-09-12 08:53:32 MDT

Hi Greg,

 We have seen this before from other sites where there is a firewall preventing the slurmdbd from contacting the slurmctld. When slurmctld restarts, it initiates the connection in the other direction and gets the updated view, but if there's a firewall in the other direction, changes made through sacctmgr won't be propagated immediately.


Kind regards,
Jason

Comment 3 Greg Wickham 2018-09-12 23:27:58 MDT

Hi Jason,

Given that slurmdbd and slurmctld run on the same node and there is no firewall, what else could be the issue?

# ps ax | grep slurm
 8473 ?        Sl    26:49 /etc/slurm-active/sbin/slurmctld
20991 ?        Sl   671:07 /etc/slurm-active/sbin/slurmdbd

# /sbin/iptables -L
Chain INPUT (policy ACCEPT)
target     prot opt source               destination         

Chain FORWARD (policy ACCEPT)
target     prot opt source               destination         

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination        
#

Changes to the slurm management database aren't picked up quickly (affecting user additions, QOS changes, account management, ... )

Comment 4 Jason Booth 2018-09-13 11:13:55 MDT

Hi Greg,

You will want to raise the log level of slurmdbd to debug and debug2 for slurmctld to be these updates. I would suggest looking at the logs to see if these updates are being sent and revived. Please find an example below. 

slurmdbd logs
[2018-09-13T11:08:55.001] debug:  sending updates to linux at 127.0.0.1(8817) ver 8192

slurmctld logs
[2018-09-13T11:08:55.002] debug2: Processing RPC: ACCOUNTING_UPDATE_MSG from uid=1020

It would also be helpful to see your slurm.conf and your slurmdbd.conf (StoragePass removed).


-Jason

Comment 5 Wei Feinstein 2018-09-14 14:48:55 MDT

Jason,

At LBNL we are seeing the same issue and we do not have a firewall setup. We are consistently having to either delete and re-add the user or just restart slurmctld.  It would be good to know what the root cause of this issue is and how it is resolved.

Comment 6 Jason Booth 2018-09-21 14:46:38 MDT

Hi Greg,

 Were you able to raise the log level as mentioned in the last message and gather some additional debug logging for me too review? If so please to attach this information along with your configuration files (slurm.conf and your slurmdbd.conf).


-Jason

Comment 7 Greg Wickham 2018-09-21 14:46:54 MDT

I am be out of the office until Wednesday, 26th September 2018.

For any issues with Ibex please either:

   - send a request to the Ibex slack channel #general
      (sign up at https://kaust-ibex.slack.com/signup)

   - open a ticket by sending an email to ibex@hpc.kaust.edu.sa

Some useful information:

  To access Ibex, the frontend nodes are:

    ilogin.ibex.kaust.edu.sa (for Intel)
    alogin.ibex.kaust.edu.sa (for AMD)
    glogin.ibex.kaust.edu.sa (for Intel with GPUs)

  For information regarding the unified clusters (tutorial, explanations etc) please refer to the wiki at:

    http://hpc.kaust.edu.sa/ibex

 -Greg

--

________________________________
This message and its contents including attachments are intended solely for the original recipient. If you are not the intended recipient or have received this message in error, please notify me immediately and delete this message from your computer system. Any unauthorized use or distribution is prohibited. Please consider the environment before printing this email.

Comment 8 Jason Booth 2018-10-04 10:59:46 MDT

Hi Greg -  Any update on this issue?

Comment 9 Jason Booth 2018-10-10 10:41:37 MDT

Hi Greg,

 I am going to close this issue out as timedout for now since I do not have any further information to look at. Please feel free to re-open if you have some additional logging for me to analyze. 

Best regards,
-Jason