7528 – slurmctld crash on global restart

Ticket 7528 - slurmctld crash on global restart

Summary: slurmctld crash on global restart

Status:	RESOLVED DUPLICATE of ticket 7532

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	slurmctld (show other tickets)
Version:	19.05.1
Hardware:	Linux Linux

Severity:	4 - Minor Issue
Assignee:	Marcin Stolarek
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2019-08-06 13:19 MDT by Paul Edmon
Modified:	2019-08-28 08:54 MDT (History)
CC List:	0 users

See Also:
Site:	Harvard University
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Paul Edmon 2019-08-06 13:19:44 MDT

We've recently upgraded to 19.05.1.  Per the documentation, adding and subtracting nodes, and other certain slurm.conf necessitate a global restart of all the slurmd's as well as slurmctld.  When we do this via salt on 2000+ nodes we have this error:

Aug  6 15:16:09 holy-slurm02 slurmctld[23313]: fatal: locks.c:128 lock_slurmctld: pthread_rwlock_rdlock(): Resource temporarily unavailable
Aug  6 15:16:09 holy-slurm02 slurmctld[23313]: fatal: locks.c:128 lock_slurmctld: pthread_rwlock_rdlock(): Resource temporarily unavailable
Aug  6 15:16:09 holy-slurm02 systemd[1]: slurmctld.service: main process exited, code=exited, status=1/FAILURE

If I the restart slurmctld it works fine, but it seems that slurmctld doesn't deal with the the stampeding herd of restarts necessitated by a global simultaneous restart.  Previous versions have always worked fine for global restarts so this is a new issue with 19.05.x.

-Paul Edmon-

Comment 2 Marcin Stolarek 2019-08-07 01:55:01 MDT

Paul, 

Did you perform any other configuration/environment changes upgrading Slurm?
I assume that you're on "systemd distro", could you please share results of:
>systemctl status slurmctld
>cat /proc/SLURMCTLDPID/limits
and slurmctld logs?

cheers,
Marcin

Comment 3 Paul Edmon 2019-08-07 08:06:13 MDT

So here is a list of the changes we made to the conf (if you look at 
ticket 7532 you can see our full conf):

RoutePlugin=route/topology
TopologyPlugin=topology/tree

|SlurmctldParameters=preempt_send_user_signal |||PrologFlags=Contain,X11 | |AccountingStorageTRES=Billing,CPU,Energy,Mem,Node,FS/Disk,FS/Lustre,Pages,VMem,IC/OFED,gres/gpu
|AcctGatherInfinibandType=acct_gather_infiniband/ofed 
AcctGatherFilesystemType=acct_gather_filesystem/lustre |
||JobAcctGatherFrequency=task=30,network=30,filesystem=30 |||||LaunchParameters=mem_sort,slurmstepd_memlock_all ||||||DefCpuPerGPU=1 DefMemPerGPU=100 GpuFreqDef=low |||||||SelectType=select/cons_tres | |||
||||PriorityFlags=NO_FAIR_TREE And adding permit_job_expansion and 
reduce_completing_frag to SchedulerParameters [root@holy-slurm02 slurm]# 
systemctl status slurmctld ● slurmctld.service - Slurm controller daemon 
Loaded: loaded (/usr/lib/systemd/system/slurmctld.service; enabled; 
vendor preset: disabled) Drop-In: 
/etc/systemd/system/slurmctld.service.d └─50-ulimit.conf Active: active 
(running) since Wed 2019-08-07 09:40:35 EDT; 23min ago Process: 155826 
ExecStart=/usr/sbin/slurmctld $SLURMCTLD_OPTIONS (code=exited, 
status=0/SUCCESS) Main PID: 155828 (slurmctld) Tasks: 722 Memory: 10.4G 
CGroup: /system.slice/slurmctld.service └─155828 /usr/sbin/slurmctld Aug 
07 10:04:29 holy-slurm02.rc.fas.harvard.edu slurmctld[155828]: 
select/cons_tres: _eval_nodes_topo: insufficient resources currently 
available for JobId=17843175 Aug 07 10:04:29 
holy-slurm02.rc.fas.harvard.edu slurmctld[155828]: select/cons_tres: 
_eval_nodes_topo: insufficient resources currently available for 
JobId=17843175 Aug 07 10:04:29 holy-slurm02.rc.fas.harvard.edu 
slurmctld[155828]: select/cons_tres: _eval_nodes_topo: insufficient 
resources currently available for JobId=17843175 Aug 07 10:04:29 
holy-slurm02.rc.fas.harvard.edu slurmctld[155828]: sched: Allocate 
JobId=17917266 NodeList=holy2b17201 #CPUs=4 Partition=hoekstra Aug 07 
10:04:29 holy-slurm02.rc.fas.harvard.edu slurmctld[155828]: 
_slurm_rpc_submit_batch_job: JobId=17917267 InitPrio=1856652 usec=28043 
Aug 07 10:04:30 holy-slurm02.rc.fas.harvard.edu slurmctld[155828]: 
sched: Allocate JobId=17917267 NodeList=holyitc14 #CPUs=1 
Partition=itc_cluster Aug 07 10:04:30 holy-slurm02.rc.fas.harvard.edu 
slurmctld[155828]: _job_complete: JobId=17917234 WEXITSTATUS 1 Aug 07 
10:04:30 holy-slurm02.rc.fas.harvard.edu slurmctld[155828]: 
_job_complete: JobId=17917234 done Aug 07 10:04:30 
holy-slurm02.rc.fas.harvard.edu slurmctld[155828]: prolog_running_decr: 
Configuration for JobId=17917265 is complete Aug 07 10:04:30 
holy-slurm02.rc.fas.harvard.edu slurmctld[155828]: Extending 
JobId=17917265 time limit by 1 secs for configuration [root@holy-slurm02 
slurm]# cat /proc/155828/limits Limit Soft Limit Hard Limit Units Max 
cpu time unlimited unlimited seconds Max file size unlimited unlimited 
bytes Max data size unlimited unlimited bytes Max stack size unlimited 
unlimited bytes Max core file size unlimited unlimited bytes Max 
resident set unlimited unlimited bytes Max processes 1030065 1030065 
processes Max open files 8192 8192 files Max locked memory 65536 65536 
bytes Max address space unlimited unlimited bytes Max file locks 
unlimited unlimited locks Max pending signals 1030065 1030065 signals 
Max msgqueue size 819200 819200 bytes Max nice priority 0 0 Max realtime 
priority 0 0 Max realtime timeout unlimited unlimited us As for the 
slurmctld.logs Those are massive, so sending them all may not be 
productive. Do you have a smaller timeslice of the logs you want to see? 
-Paul Edmon- ||||

On 8/7/19 3:55 AM, bugs@schedmd.com wrote:
>
> *Comment # 2 <https://bugs.schedmd.com/show_bug.cgi?id=7528#c2> on bug 
> 7528 <https://bugs.schedmd.com/show_bug.cgi?id=7528> from Marcin 
> Stolarek <mailto:cinek@schedmd.com> *
> Paul,
>
> Did you perform any other configuration/environment changes upgrading Slurm?
> I assume that you're on "systemd distro", could you please share results of:
> >systemctl status slurmctld >cat /proc/SLURMCTLDPID/limits
> and slurmctld logs?
>
> cheers,
> Marcin
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>

Comment 4 Marcin Stolarek 2019-08-28 05:57:25 MDT

Paul,

Did you try to perform "global restart" of slurmd after patch from bug 7532? I'm not 100% sure, but it may be related.

cheers,
Marcin

Comment 5 Paul Edmon 2019-08-28 07:59:01 MDT

Yes, everything is stable since we got that patch.  So I would mark this 
one as resolved.

-Paul Edmon-

On 8/28/19 7:57 AM, bugs@schedmd.com wrote:
>
> *Comment # 4 <https://bugs.schedmd.com/show_bug.cgi?id=7528#c4> on bug 
> 7528 <https://bugs.schedmd.com/show_bug.cgi?id=7528> from Marcin 
> Stolarek <mailto:cinek@schedmd.com> *
> Paul,
>
> Did you try to perform "global restart" of slurmd after patch frombug 7532  <show_bug.cgi?id=7532>?
> I'm not 100% sure, but it may be related.
>
> cheers,
> Marcin
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>

Comment 6 Marcin Stolarek 2019-08-28 08:54:27 MDT

Thanks for quick reply. I'm closing this as duplicate. 

cheers,
Marcin

*** This ticket has been marked as a duplicate of ticket 7532 ***