Ticket 5452 - slurmctld crashes again, core dump created
Summary: slurmctld crashes again, core dump created
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmctld (show other tickets)
Version: 17.11.4
Hardware: Linux Linux
: 2 - High Impact
Assignee: Dominik Bartkiewicz
QA Contact:
URL:
: 5438 5447 5675 (view as ticket list)
Depends on:
Blocks:
 
Reported: 2018-07-19 06:58 MDT by Damien
Modified: 2020-01-30 06:57 MST (History)
6 users (show)

See Also:
Site: Monash University
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 17.11.9
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
Zipped slurmctld log (190.63 MB, application/zip)
2018-07-19 07:08 MDT, Damien
Details
core file (133.50 MB, application/octet-stream)
2018-07-19 07:10 MDT, Damien
Details
Prevent job_resrcs from being overwritten for multi-partition job submissions (545 bytes, patch)
2018-07-26 13:15 MDT, Dominik Bartkiewicz
Details | Diff

Note You need to log in before you can comment on or make changes to this ticket.
Description Damien 2018-07-19 06:58:43 MDT
Good Evening SLURM support

Our slurmctld crashed again, there is a core dump. There is no obvious symptom.

The backup slurmctld did take over.


From the slurmctld logs:
---
[2018-07-19T06:16:16.850] debug3: cons_res: _vns: node m3i038 no mem 11437 < 64000
[2018-07-19T06:16:16.850] debug3: cons_res: _vns: node m3i039 no mem 11437 < 64000
[2018-07-19T06:16:16.850] debug3: cons_res: _vns: node m3i040 no mem 11437 < 64000
[2018-07-19T06:16:16.850] debug3: cons_res: _vns: node m3i041 no mem 47213 < 64000
[2018-07-19T06:16:16.850] debug2: job 2811145 being held, if allowed the job request will exceed QOS normal max tres(cpu) per user limit 280 with already used 272 + requested 16
[2018-07-19T06:16:16.850] debug3: backfill: Failed to start JobId=2811145: Job violates accounting/QOS policy (job submit limit, user's size and/or time limits)
---

As you can see there is a big time gap between 2018-07-19T06:16:16.855  and 2018-07-19T06:16:25.321   ... I guess this is where it crashed ..


We are attending SLURM training, I hope that this is no coincidentally.


Kindly investigate. We are hoping to find the root cause.


Many Thanks.


Damien
Comment 1 Dominik Bartkiewicz 2018-07-19 07:05:24 MDT
Hi

Could you use gdb and generate the backtrace?
eg.: gdb -batch -ex "thread apply all bt full" <core file>

Dominik
Comment 2 Damien 2018-07-19 07:08:01 MDT
Created attachment 7354 [details]
Zipped slurmctld log
Comment 3 Damien 2018-07-19 07:10:00 MDT
Created attachment 7355 [details]
core file
Comment 4 Dominik Bartkiewicz 2018-07-19 07:22:07 MDT
Hi

Without all binary and libs, core file is useless.
You must generate backtrace on slurmctld machine, you can do it like in comment 1.

Dominik
Comment 5 Damien 2018-07-19 07:22:49 MDT
There is two core files today:

[root@m3-mgmt2 slurm-logs]# gdb -batch -ex "thread apply all bt full" core.14684
[New LWP 31599]
[New LWP 14686]
[New LWP 14823]
[New LWP 14689]
[New LWP 14830]
[New LWP 14684]
[New LWP 14685]
[New LWP 14687]
[New LWP 14690]
[New LWP 14696]
[New LWP 14744]
[New LWP 14745]
[New LWP 14818]
[New LWP 14832]
[New LWP 14828]
[New LWP 14829]
Missing separate debuginfo for the main executable file
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/71/a7c60e3a83c09c01aec6f05752aef7f4e632e4
Core was generated by `/opt/slurm-17.11.4/sbin/slurmctld'.
Program terminated with signal 6, Aborted.
#0  0x00007f984fdae5f7 in ?? ()
"/mnt/slurm-logs/core.14684" is a core file.
Please specify an executable to debug.

Thread 16 (LWP 14829):
#0  0x00007f9850149e91 in ?? ()
No symbol table info available.
#1  0x0000000000000000 in ?? ()
No symbol table info available.

Thread 15 (LWP 14828):
#0  0x00007f984fe67413 in ?? ()
No symbol table info available.
#1  0x0000000000000002 in ?? ()
No symbol table info available.
#2  0x0000000000424f09 in ?? ()
No symbol table info available.
#3  0x0000000000000001 in ?? ()
No symbol table info available.
#4  0x0000000000000000 in ?? ()
No symbol table info available.

Thread 14 (LWP 14832):
#0  0x00007f98501466d5 in ?? ()
No symbol table info available.
#1  0x0000000000000000 in ?? ()
No symbol table info available.

Thread 13 (LWP 14818):
#0  0x00007f984fe36efd in ?? ()
No symbol table info available.
#1  0x0000000000000002 in ?? ()
No symbol table info available.
#2  0x00007f984fe36d94 in ?? ()
No symbol table info available.
#3  0x0000000000000030 in ?? ()
No symbol table info available.
#4  0x000000000b642851 in ?? ()
No symbol table info available.
#5  0x0000000000010000 in ?? ()
No symbol table info available.
#6  0x0000000000000000 in ?? ()
No symbol table info available.

Thread 12 (LWP 14745):
#0  0x00007f9850146a82 in ?? ()
No symbol table info available.
#1  0x0000000000000000 in ?? ()
No symbol table info available.

Thread 11 (LWP 14744):
#0  0x00007f9850146a82 in ?? ()
No symbol table info available.
#1  0x0000000000000000 in ?? ()
No symbol table info available.

Thread 10 (LWP 14696):
#0  0x00007f9850146a82 in ?? ()
No symbol table info available.
#1  0x0000000000000000 in ?? ()
No symbol table info available.

Thread 9 (LWP 14690):
#0  0x00007f9850146a82 in ?? ()
No symbol table info available.
#1  0x0000000000000000 in ?? ()
No symbol table info available.

Thread 8 (LWP 14687):
#0  0x00007f9850143ef7 in ?? ()
No symbol table info available.
#1  0x00007f9850143e30 in ?? ()
No symbol table info available.
#2  0x00007f984d11ad28 in ?? ()
No symbol table info available.
#3  0x00007f984d019700 in ?? ()
No symbol table info available.
#4  0x0000000000000000 in ?? ()
No symbol table info available.

Thread 7 (LWP 14685):
#0  0x00007f9850146a82 in ?? ()
No symbol table info available.
#1  0x0000100e00000000 in ?? ()
No symbol table info available.
#2  0x00000000006dcfe0 in ?? ()
No symbol table info available.
#3  0x00000000006dd020 in ?? ()
No symbol table info available.
#4  0x0000000000001084 in ?? ()
No symbol table info available.
#5  0x00007f9848000930 in ?? ()
No symbol table info available.
#6  0x0000000000000000 in ?? ()
No symbol table info available.

Thread 6 (LWP 14684):
#0  0x00007f984fe36efd in ?? ()
No symbol table info available.
#1  0x0000000000000002 in ?? ()
No symbol table info available.
#2  0x00007f984fe67b34 in ?? ()
No symbol table info available.
#3  0x0000000000000000 in ?? ()
No symbol table info available.

Thread 5 (LWP 14830):
#0  0x00007f98501466d5 in ?? ()
No symbol table info available.
#1  0x0000026b00000000 in ?? ()
No symbol table info available.
#2  0x00000000006de260 in ?? ()
No symbol table info available.
#3  0x00000000006de2a0 in ?? ()
No symbol table info available.
#4  0x000000000000ccb0 in ?? ()
No symbol table info available.
#5  0x0000000000000000 in ?? ()
No symbol table info available.

Thread 4 (LWP 14689):
#0  0x00007f9850146a82 in ?? ()
No symbol table info available.
#1  0x0000010500000000 in ?? ()
No symbol table info available.
#2  0x00007f985090cbc0 in ?? ()
No symbol table info available.
#3  0x00007f985090cc00 in ?? ()
No symbol table info available.
#4  0x0000000000000218 in ?? ()
No symbol table info available.
#5  0x0000000000000000 in ?? ()
No symbol table info available.

Thread 3 (LWP 14823):
#0  0x00007f9850143ef7 in ?? ()
No symbol table info available.
#1  0x00007f9850143e30 in ?? ()
No symbol table info available.
#2  0x00007f98471d5d28 in ?? ()
No symbol table info available.
#3  0x00007f9846ed2700 in ?? ()
No symbol table info available.
#4  0x0000000000000000 in ?? ()
No symbol table info available.

Thread 2 (LWP 14686):
#0  0x00007f984fe36efd in ?? ()
No symbol table info available.
#1  0x0000000000000002 in ?? ()
No symbol table info available.
#2  0x00007f984fe36d94 in ?? ()
No symbol table info available.
#3  0x0000000000000003 in ?? ()
No symbol table info available.
#4  0x000000003785cb7a in ?? ()
No symbol table info available.
#5  0x0000000000010000 in ?? ()
No symbol table info available.
#6  0x0000000000000000 in ?? ()
No symbol table info available.

Thread 1 (LWP 31599):
#0  0x00007f984fdae5f7 in ?? ()
No symbol table info available.
#1  0x00007f984fdafce8 in ?? ()
No symbol table info available.
#2  0x0000000000000020 in ?? ()
No symbol table info available.
#3  0x0000000000000000 in ?? ()
No symbol table info available.


And 


[root@m3-mgmt2 slurm-logs]# gdb -batch -ex "thread apply all bt full" core.32003 
[New LWP 32010]
[New LWP 32014]
[New LWP 32012]
[New LWP 32013]
[New LWP 32011]
[New LWP 32015]
[New LWP 32022]
[New LWP 32016]
[New LWP 32020]
[New LWP 32019]
[New LWP 32004]
[New LWP 32003]
[New LWP 32005]
[New LWP 32006]
[New LWP 32008]
Missing separate debuginfo for the main executable file
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/71/a7c60e3a83c09c01aec6f05752aef7f4e632e4
Core was generated by `/opt/slurm-17.11.4/sbin/slurmctld'.
Program terminated with signal 6, Aborted.
#0  0x00007efc7b4605f7 in ?? ()
"/mnt/slurm-logs/core.32003" is a core file.
Please specify an executable to debug.

Thread 15 (LWP 32008):
#0  0x00007efc7b7f8a82 in ?? ()
No symbol table info available.
#1  0x0002648a00000000 in ?? ()
No symbol table info available.
#2  0x00007efc7bfbebc0 in ?? ()
No symbol table info available.
#3  0x00007efc7bfbec00 in ?? ()
No symbol table info available.
#4  0x000000000002c2a3 in ?? ()
No symbol table info available.
#5  0x0000000000000000 in ?? ()
No symbol table info available.

Thread 14 (LWP 32006):
#0  0x00007efc7b7f5ef7 in ?? ()
No symbol table info available.
#1  0x00007efc7b7f5e30 in ?? ()
No symbol table info available.
#2  0x00007efc787ccd28 in ?? ()
No symbol table info available.
#3  0x00007efc786cb700 in ?? ()
No symbol table info available.
#4  0x0000000000000000 in ?? ()
No symbol table info available.

Thread 13 (LWP 32005):
#0  0x00007efc7b4e8efd in ?? ()
No symbol table info available.
#1  0x0000000000000002 in ?? ()
No symbol table info available.
#2  0x00007efc7b4e8d94 in ?? ()
No symbol table info available.
#3  0x0000000000000001 in ?? ()
No symbol table info available.
#4  0x000000001ac89147 in ?? ()
No symbol table info available.
#5  0x0000000000010000 in ?? ()
No symbol table info available.
#6  0x0000000000000000 in ?? ()
No symbol table info available.

Thread 12 (LWP 32003):
#0  0x00007efc7b4e8efd in ?? ()
No symbol table info available.
#1  0x0000000000000002 in ?? ()
No symbol table info available.
#2  0x00007efc7b519b34 in ?? ()
No symbol table info available.
#3  0x0000000000000000 in ?? ()
No symbol table info available.

Thread 11 (LWP 32004):
#0  0x00007efc7b7f86d5 in ?? ()
No symbol table info available.
#1  0x0005893000000000 in ?? ()
No symbol table info available.
#2  0x00000000006ddd00 in ?? ()
No symbol table info available.
#3  0x00000000006ddd40 in ?? ()
No symbol table info available.
#4  0x00000000001bf787 in ?? ()
No symbol table info available.
#5  0x0000000000101000 in ?? ()
No symbol table info available.
#6  0x0000000000464f20 in ?? ()
No symbol table info available.
#7  0x0000000000000000 in ?? ()
No symbol table info available.

Thread 10 (LWP 32019):
#0  0x00007efc7b7fbe91 in ?? ()
No symbol table info available.
#1  0x0000000000000000 in ?? ()
No symbol table info available.

Thread 9 (LWP 32020):
#0  0x00007efc7b7f86d5 in ?? ()
No symbol table info available.
#1  0x0002b50a00000000 in ?? ()
No symbol table info available.
#2  0x00000000006de260 in ?? ()
No symbol table info available.
#3  0x00000000006de2a0 in ?? ()
No symbol table info available.
#4  0x00000000014738ae in ?? ()
No symbol table info available.
#5  0x0000000000000000 in ?? ()
No symbol table info available.

Thread 8 (LWP 32016):
#0  0x00007efc7b519413 in ?? ()
No symbol table info available.
#1  0x0000000000000002 in ?? ()
No symbol table info available.
#2  0x0000000000424f09 in ?? ()
No symbol table info available.
#3  0x0000000000000001 in ?? ()
No symbol table info available.
#4  0x0000000000000000 in ?? ()
No symbol table info available.

Thread 7 (LWP 32022):
#0  0x00007efc7b7f86d5 in ?? ()
No symbol table info available.
#1  0x0000000000000000 in ?? ()
No symbol table info available.

Thread 6 (LWP 32015):
#0  0x00007efc7b7f5ef7 in ?? ()
No symbol table info available.
#1  0x00007efc7b7f5e30 in ?? ()
No symbol table info available.
#2  0x00007efc726bad28 in ?? ()
No symbol table info available.
#3  0x00007efc725b9700 in ?? ()
No symbol table info available.
#4  0x0000000000000000 in ?? ()
No symbol table info available.

Thread 5 (LWP 32011):
#0  0x00007efc7b7f8a82 in ?? ()
No symbol table info available.
#1  0x0000000000000000 in ?? ()
No symbol table info available.

Thread 4 (LWP 32013):
#0  0x00007efc7b7f8a82 in ?? ()
No symbol table info available.
#1  0x0000000000000000 in ?? ()
No symbol table info available.

Thread 3 (LWP 32012):
#0  0x00007efc7b7f8a82 in ?? ()
No symbol table info available.
#1  0x0000000000000000 in ?? ()
No symbol table info available.

Thread 2 (LWP 32014):
#0  0x00007efc7b4e8efd in ?? ()
No symbol table info available.
#1  0x0000000000000002 in ?? ()
No symbol table info available.
#2  0x00007efc7b4e8d94 in ?? ()
No symbol table info available.
#3  0x0000000000000056 in ?? ()
No symbol table info available.
#4  0x000000001d8da290 in ?? ()
No symbol table info available.
#5  0x0000000000010000 in ?? ()
No symbol table info available.
#6  0x0000000000000000 in ?? ()
No symbol table info available.

Thread 1 (LWP 32010):
#0  0x00007efc7b4605f7 in ?? ()
No symbol table info available.
#1  0x00007efc7b461ce8 in ?? ()
No symbol table info available.
#2  0x0000000000000020 in ?? ()
No symbol table info available.
#3  0x0000000000000000 in ?? ()
No symbol table info available.
Comment 6 Damien 2018-07-19 07:24:41 MDT
Here is more info:


scontrol show config
Configuration data as of 2018-07-19T23:23:40
AccountingStorageBackupHost = m3-mgmt1
AccountingStorageEnforce = associations,limits,qos
AccountingStorageHost   = m3-mgmt2
AccountingStorageLoc    = N/A
AccountingStoragePort   = 6819
AccountingStorageTRES   = cpu,mem,energy,node,billing,gres/gpu
AccountingStorageType   = accounting_storage/slurmdbd
AccountingStorageUser   = N/A
AccountingStoreJobComment = Yes
AcctGatherEnergyType    = acct_gather_energy/none
AcctGatherFilesystemType = acct_gather_filesystem/none
AcctGatherInterconnectType = acct_gather_interconnect/none
AcctGatherNodeFreq      = 0 sec
AcctGatherProfileType   = acct_gather_profile/none
AllowSpecResourcesUsage = 0
AuthInfo                = (null)
AuthType                = auth/munge
BackupAddr              = m3-mgmt1
BackupController        = m3-mgmt1
BatchStartTimeout       = 10 sec
BOOT_TIME               = 2018-07-19T18:11:26
BurstBufferType         = (null)
CheckpointType          = checkpoint/none
ChosLoc                 = (null)
ClusterName             = m3
CompleteWait            = 10 sec
ControlAddr             = m3-mgmt2
ControlMachine          = m3-mgmt2
CoreSpecPlugin          = core_spec/none
CpuFreqDef              = Unknown
CpuFreqGovernors        = Performance,OnDemand
CryptoType              = crypto/munge
DebugFlags              = Gres
DefMemPerNode           = UNLIMITED
DisableRootJobs         = Yes
EioTimeout              = 60
EnforcePartLimits       = ALL
Epilog                  = /opt/slurm/etc/slurm.epilog
EpilogMsgTime           = 2000 usec
EpilogSlurmctld         = (null)
ExtSensorsType          = ext_sensors/none
ExtSensorsFreq          = 0 sec
FairShareDampeningFactor = 1
FastSchedule            = 1
FederationParameters    = (null)
FirstJobId              = 1
GetEnvTimeout           = 2 sec
GresTypes               = gpu
GroupUpdateForce        = 1
GroupUpdateTime         = 600 sec
HASH_VAL                = Match
HealthCheckInterval     = 300 sec
HealthCheckNodeState    = ANY
HealthCheckProgram      = /opt/nhc-1.4.2/sbin/nhc
InactiveLimit           = 0 sec
JobAcctGatherFrequency  = 30
JobAcctGatherType       = jobacct_gather/linux
JobAcctGatherParams     = (null)
JobCheckpointDir        = /var/slurm/checkpoint
JobCompHost             = localhost
JobCompLoc              = /var/log/slurm_jobcomp.log
JobCompPort             = 0
JobCompType             = jobcomp/none
JobCompUser             = root
JobContainerType        = job_container/none
JobCredentialPrivateKey = (null)
JobCredentialPublicCertificate = (null)
JobFileAppend           = 0
JobRequeue              = 1
JobSubmitPlugins        = (null)
KeepAliveTime           = SYSTEM_DEFAULT
KillOnBadExit           = 0
KillWait                = 10 sec
LaunchParameters        = (null)
LaunchType              = launch/slurm
Layouts                 = 
Licenses                = (null)
LicensesUsed            = (null)
LogTimeFormat           = iso8601_ms
MailDomain              = (null)
MailProg                = /bin/mail
MaxArraySize            = 1001
MaxJobCount             = 15000
MaxJobId                = 67043328
MaxMemPerNode           = UNLIMITED
MaxStepCount            = 40000
MaxTasksPerNode         = 512
MCSPlugin               = mcs/none
MCSParameters           = (null)
MemLimitEnforce         = Yes
MessageTimeout          = 10 sec
MinJobAge               = 300 sec
MpiDefault              = pmi2
MpiParams               = ports=12000-12999
MsgAggregationParams    = (null)
NEXT_JOB_ID             = 2815470
NodeFeaturesPlugins     = (null)
OverTimeLimit           = 1 min
PluginDir               = /opt/slurm-17.11.4/lib/slurm
PlugStackConfig         = /opt/slurm-17.11.4/etc/plugstack.conf
PowerParameters         = (null)
PowerPlugin             = 
PreemptMode             = REQUEUE
PreemptType             = preempt/qos
PriorityParameters      = (null)
PriorityDecayHalfLife   = 14-00:00:00
PriorityCalcPeriod      = 00:05:00
PriorityFavorSmall      = Yes
PriorityFlags           = FAIR_TREE
PriorityMaxAge          = 14-00:00:00
PriorityUsageResetPeriod = NONE
PriorityType            = priority/multifactor
PriorityWeightAge       = 10000
PriorityWeightFairShare = 80000
PriorityWeightJobSize   = 10000
PriorityWeightPartition = 10000
PriorityWeightQOS       = 60000
PriorityWeightTRES      = (null)
PrivateData             = none
ProctrackType           = proctrack/cgroup
Prolog                  = /opt/slurm/etc/slurm.prolog
PrologEpilogTimeout     = 65534
PrologSlurmctld         = (null)
PrologFlags             = (null)
PropagatePrioProcess    = 0
PropagateResourceLimits = ALL
PropagateResourceLimitsExcept = (null)
RebootProgram           = (null)
ReconfigFlags           = (null)
RequeueExit             = (null)
RequeueExitHold         = (null)
ResumeProgram           = (null)
ResumeRate              = 300 nodes/min
ResumeTimeout           = 60 sec
ResvEpilog              = (null)
ResvOverRun             = 0 min
ResvProlog              = (null)
ReturnToService         = 1
RoutePlugin             = route/default
SallocDefaultCommand    = (null)
SbcastParameters        = (null)
SchedulerParameters     = (null)
SchedulerTimeSlice      = 30 sec
SchedulerType           = sched/backfill
SelectType              = select/cons_res
SelectTypeParameters    = CR_CORE_MEMORY
SlurmUser               = slurm(497)
SlurmctldDebug          = debug3
SlurmctldLogFile        = /mnt/slurm-logs/slurmctld.log
SlurmctldPort           = 6817
SlurmctldSyslogDebug    = quiet
SlurmctldTimeout        = 300 sec
SlurmdDebug             = debug5
SlurmdLogFile           = /var/log/slurmd.log
SlurmdPidFile           = /opt/slurm/var/run/slurmd.pid
SlurmdPort              = 6818
SlurmdSpoolDir          = /opt/slurm/var/spool
SlurmdSyslogDebug       = quiet
SlurmdTimeout           = 300 sec
SlurmdUser              = root(0)
SlurmSchedLogFile       = /mnt/slurm-logs/slurmsched.log
SlurmSchedLogLevel      = 9
SlurmctldPidFile        = /opt/slurm/var/run/slurmctld.pid
SlurmctldPlugstack      = (null)
SLURM_CONF              = /opt/slurm-17.11.4/etc/slurm.conf
SLURM_VERSION           = 17.11.4
SrunEpilog              = (null)
SrunPortRange           = 0-0
SrunProlog              = (null)
StateSaveLocation       = /opt/slurm/var/state
SuspendExcNodes         = (null)
SuspendExcParts         = (null)
SuspendProgram          = (null)
SuspendRate             = 60 nodes/min
SuspendTime             = NONE
SuspendTimeout          = 30 sec
SwitchType              = switch/none
TaskEpilog              = (null)
TaskPlugin              = task/cgroup
TaskPluginParam         = (null type)
TaskProlog              = (null)
TCPTimeout              = 2 sec
TmpFS                   = /tmp
TopologyParam           = (null)
TopologyPlugin          = topology/none
TrackWCKey              = No
TreeWidth               = 50
UsePam                  = 0
UnkillableStepProgram   = (null)
UnkillableStepTimeout   = 60 sec
VSizeFactor             = 0 percent
WaitTime                = 0 sec

Slurmctld(primary/backup) at m3-mgmt2/m3-mgmt1 are UP/UP



Kindly investigate.


Thanks

Damien
Comment 7 Dominik Bartkiewicz 2018-07-19 07:57:07 MDT
Hi

Let’s try to add binary path, maybe this will give us proper backtrace:
 
gdb -batch -ex "thread apply all bt full" /opt/slurm-17.11.4/sbin/slurmctld core.14684

Dominik
Comment 8 Damien 2018-07-19 08:01:23 MDT
There you go:

gdb -batch -ex "thread apply all bt full" /opt/slurm-17.11.4/sbin/slurmctld core.32003 
[New LWP 32010]
[New LWP 32014]
[New LWP 32012]
[New LWP 32013]
[New LWP 32011]
[New LWP 32015]
[New LWP 32022]
[New LWP 32016]
[New LWP 32020]
[New LWP 32019]
[New LWP 32004]
[New LWP 32003]
[New LWP 32005]
[New LWP 32006]
[New LWP 32008]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/opt/slurm-17.11.4/sbin/slurmctld'.
Program terminated with signal 6, Aborted.
#0  0x00007efc7b4605f7 in raise () from /lib64/libc.so.6

Thread 15 (Thread 0x7efc785ca700 (LWP 32008)):
#0  0x00007efc7b7f8a82 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
No symbol table info available.
#1  0x00007efc7bd4765f in _agent (x=<optimized out>) at slurmdbd_defs.c:1979
        err = <optimized out>
        cnt = <optimized out>
        rc = <optimized out>
        buffer = <optimized out>
        abs_time = {tv_sec = 1531981566, tv_nsec = 0}
        fail_time = 0
        sigarray = {10, 0}
        list_req = {msg_type = 1474, data = 0x7efc785c9ea0}
        list_msg = {my_list = 0x0, return_code = 0}
        __func__ = "_agent"
#2  0x00007efc7b7f4dc5 in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#3  0x00007efc7b521ced in clone () from /lib64/libc.so.6
No symbol table info available.

Thread 14 (Thread 0x7efc786cb700 (LWP 32006)):
#0  0x00007efc7b7f5ef7 in pthread_join () from /lib64/libpthread.so.0
No symbol table info available.
#1  0x00007efc787cff6e in _cleanup_thread (no_data=<optimized out>) at accounting_storage_slurmdbd.c:445
No locals.
#2  0x00007efc7b7f4dc5 in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#3  0x00007efc7b521ced in clone () from /lib64/libc.so.6
No symbol table info available.

Thread 13 (Thread 0x7efc787cc700 (LWP 32005)):
#0  0x00007efc7b4e8efd in nanosleep () from /lib64/libc.so.6
No symbol table info available.
#1  0x00007efc7b4e8d94 in sleep () from /lib64/libc.so.6
No symbol table info available.
#2  0x00007efc787d080e in _set_db_inx_thread (no_data=<optimized out>) at accounting_storage_slurmdbd.c:437
        local_job_list = <optimized out>
        job_ptr = <optimized out>
        itr = <optimized out>
        job_read_lock = {config = NO_LOCK, job = READ_LOCK, node = NO_LOCK, partition = NO_LOCK, federation = NO_LOCK}
        job_write_lock = {config = NO_LOCK, job = WRITE_LOCK, node = NO_LOCK, partition = NO_LOCK, federation = NO_LOCK}
        __func__ = "_set_db_inx_thread"
#3  0x00007efc7b7f4dc5 in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#4  0x00007efc7b521ced in clone () from /lib64/libc.so.6
No symbol table info available.

Thread 12 (Thread 0x7efc7c1d1740 (LWP 32003)):
#0  0x00007efc7b4e8efd in nanosleep () from /lib64/libc.so.6
No symbol table info available.
#1  0x00007efc7b519b34 in usleep () from /lib64/libc.so.6
No symbol table info available.
#2  0x00000000004279f4 in _slurmctld_background (no_data=0x0) at controller.c:1767
        i = 8
        job_limit = <optimized out>
        delta_t = 7
        last_full_sched_time = 1531981512
        last_ctld_bu_ping = 1531981415
        last_uid_update = 1531980772
        last_reboot_msg_time = 1531199546
        ping_interval = 100
        job_read_lock = {config = READ_LOCK, job = READ_LOCK, node = NO_LOCK, partition = NO_LOCK, federation = NO_LOCK}
        job_write_lock = {config = READ_LOCK, job = WRITE_LOCK, node = WRITE_LOCK, partition = READ_LOCK, federation = READ_LOCK}
        job_node_read_lock = {config = NO_LOCK, job = READ_LOCK, node = READ_LOCK, partition = NO_LOCK, federation = NO_LOCK}
        last_group_time = 1531981473
        last_acct_gather_node_time = 1531199545
        last_ext_sensors_time = 1531199545
        last_resv_time = 1531981556
        tv1 = {tv_sec = 1531981560, tv_usec = 209825}
        node_write_lock2 = {config = NO_LOCK, job = NO_LOCK, node = WRITE_LOCK, partition = NO_LOCK, federation = NO_LOCK}
        last_timelimit_time = 1531981536
        last_assert_primary_time = 1531981358
        purge_job_interval = 60
        tv2 = {tv_sec = 1531981560, tv_usec = 209832}
        config_read_lock = {config = READ_LOCK, job = NO_LOCK, node = NO_LOCK, partition = NO_LOCK, federation = NO_LOCK}
        node_write_lock = {config = READ_LOCK, job = WRITE_LOCK, node = WRITE_LOCK, partition = NO_LOCK, federation = NO_LOCK}
        last_purge_job_time = 1531981512
        last_node_acct = 1531981285
        no_resp_msg_interval = <optimized out>
        tv_str = "usec=7\000\000\064\066\000\067\000\000\000\000\000\000\000"
        job_write_lock2 = {config = NO_LOCK, job = WRITE_LOCK, node = NO_LOCK, partition = NO_LOCK, federation = NO_LOCK}
        last_no_resp_msg_time = 1531981560
        now = <optimized out>
        last_sched_time = 1531981554
        last_ping_node_time = 1531981506
        part_write_lock = {config = NO_LOCK, job = NO_LOCK, node = NO_LOCK, partition = WRITE_LOCK, federation = NO_LOCK}
        last_health_check_time = 1531981363
        last_checkpoint_time = 1531981452
        last_ping_srun_time = 1531199545
        last_trigger = 1531981546
#3  main (argc=<optimized out>, argv=<optimized out>) at controller.c:604
        cnt = <optimized out>
        error_code = <optimized out>
        i = 3
        stat_buf = {st_dev = 64769, st_ino = 143988, st_nlink = 1, st_mode = 33261, st_uid = 0, st_gid = 0, __pad0 = 0, st_rdev = 0, st_size = 392784, st_blksize = 4096, st_blocks = 768, st_atim = {tv_sec = 1531188475, tv_nsec = 698018084}, st_mtim = {tv_sec = 1418762451, tv_nsec = 0}, st_ctim = {tv_sec = 1463116446, tv_nsec = 736207571}, __unused = {0, 0, 0}}
        rlim = {rlim_cur = 18446744073709551615, rlim_max = 18446744073709551615}
        config_write_lock = {config = WRITE_LOCK, job = WRITE_LOCK, node = WRITE_LOCK, partition = WRITE_LOCK, federation = NO_LOCK}
        node_part_write_lock = {config = NO_LOCK, job = NO_LOCK, node = WRITE_LOCK, partition = WRITE_LOCK, federation = NO_LOCK}
        callbacks = {acct_full = 0x4a93eb <trigger_primary_ctld_acct_full>, dbd_fail = 0x4a95fa <trigger_primary_dbd_fail>, dbd_resumed = 0x4a9688 <trigger_primary_dbd_res_op>, db_fail = 0x4a970d <trigger_primary_db_fail>, db_resumed = 0x4a979b <trigger_primary_db_res_op>}
        create_clustername_file = 44
        __func__ = "main"

Thread 11 (Thread 0x7efc7c1d0700 (LWP 32004)):
#0  0x00007efc7b7f86d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
No symbol table info available.
#1  0x0000000000464f20 in _wr_wrlock (datatype=datatype@entry=JOB_LOCK) at locks.c:229
        err = <optimized out>
        __func__ = "_wr_wrlock"
#2  0x000000000046516c in lock_slurmctld (lock_levels=...) at locks.c:133
No locals.
#3  0x000000000041e1ac in _agent_retry (mail_too=false, min_wait=999) at agent.c:1381
        agent_arg_ptr = 0x0
        mi = 0x0
        rc = <optimized out>
        now = 1531981561
        queued_req_ptr = 0x0
        retry_iter = <optimized out>
        job_write_lock = {config = NO_LOCK, job = WRITE_LOCK, node = NO_LOCK, partition = NO_LOCK, federation = NO_LOCK}
#4  _agent_init (arg=<optimized out>) at agent.c:1326
        min_wait = 999
        mail_too = false
        ts = {tv_sec = 1531981562, tv_nsec = 0}
        __func__ = "_agent_init"
#5  0x00007efc7b7f4dc5 in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#6  0x00007efc7b521ced in clone () from /lib64/libc.so.6
No symbol table info available.

Thread 10 (Thread 0x7efc721b5700 (LWP 32019)):
#0  0x00007efc7b7fbe91 in sigwait () from /lib64/libpthread.so.0
No symbol table info available.
#1  0x000000000042925c in _slurmctld_signal_hand (no_data=<optimized out>) at controller.c:891
        sig = 1
        i = <optimized out>
        rc = <optimized out>
        sig_array = {2, 15, 1, 6, 12, 0}
        set = {__val = {18467, 0 <repeats 15 times>}}
        __func__ = "_slurmctld_signal_hand"
#2  0x00007efc7b7f4dc5 in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#3  0x00007efc7b521ced in clone () from /lib64/libc.so.6
No symbol table info available.

Thread 9 (Thread 0x7efc720b4700 (LWP 32020)):
#0  0x00007efc7b7f86d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
No symbol table info available.
#1  0x000000000049ee9c in slurmctld_state_save (no_data=<optimized out>) at state_save.c:204
        err = <optimized out>
        last_save = 1531981553
        now = 1531981553
        save_delay = <optimized out>
        run_save = <optimized out>
        save_count = 0
        __func__ = "slurmctld_state_save"
#2  0x00007efc7b7f4dc5 in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#3  0x00007efc7b521ced in clone () from /lib64/libc.so.6
No symbol table info available.

Thread 8 (Thread 0x7efc722b6700 (LWP 32016)):
#0  0x00007efc7b519413 in select () from /lib64/libc.so.6
No symbol table info available.
#1  0x0000000000424f09 in _slurmctld_rpc_mgr (no_data=<optimized out>) at controller.c:1026
        max_fd = <optimized out>
        newsockfd = <optimized out>
        sockfd = 0x7efc5c000950
        cli_addr = {sin_family = 2, sin_port = 35469, sin_addr = {s_addr = 4039643308}, sin_zero = "\000\000\000\000\000\000\000"}
        srv_addr = {sin_family = 2, sin_port = 41242, sin_addr = {s_addr = 0}, sin_zero = "\000\000\000\000\000\000\000"}
        port = 41242
        ip = "0.0.0.0", '\000' <repeats 24 times>
        fd_next = 0
        i = <optimized out>
        nports = 1
        rfds = {__fds_bits = {8, 0 <repeats 15 times>}}
        conn_arg = <optimized out>
        config_read_lock = {config = READ_LOCK, job = NO_LOCK, node = NO_LOCK, partition = NO_LOCK, federation = NO_LOCK}
        sigarray = {10, 0}
        node_addr = <optimized out>
        __func__ = "_slurmctld_rpc_mgr"
#2  0x00007efc7b7f4dc5 in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#3  0x00007efc7b521ced in clone () from /lib64/libc.so.6
No symbol table info available.

Thread 7 (Thread 0x7efc71eb2700 (LWP 32022)):
#0  0x00007efc7b7f86d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
No symbol table info available.
#1  0x0000000000423536 in _purge_files_thread (no_data=<optimized out>) at controller.c:3160
        err = <optimized out>
        job_id = 0x0
        __func__ = "_purge_files_thread"
#2  0x00007efc7b7f4dc5 in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#3  0x00007efc7b521ced in clone () from /lib64/libc.so.6
No symbol table info available.

Thread 6 (Thread 0x7efc725b9700 (LWP 32015)):
#0  0x00007efc7b7f5ef7 in pthread_join () from /lib64/libpthread.so.0
No symbol table info available.
#1  0x00007efc729bfe75 in _cleanup_thread (no_data=<optimized out>) at priority_multifactor.c:1453
No locals.
#2  0x00007efc7b7f4dc5 in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#3  0x00007efc7b521ced in clone () from /lib64/libc.so.6
No symbol table info available.

Thread 5 (Thread 0x7efc72eca700 (LWP 32011)):
#0  0x00007efc7b7f8a82 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
No symbol table info available.
#1  0x000000000043543a in _heartbeat_thread (no_data=<optimized out>) at heartbeat.c:130
        err = <optimized out>
        beat = 30
        now = <optimized out>
        nl = 16730398017500217344
        ts = {tv_sec = 1531981574, tv_nsec = 0}
        reg_file = 0x0
        new_file = 0x0
        fd = 7
        __func__ = "_heartbeat_thread"
#2  0x00007efc7b7f4dc5 in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#3  0x00007efc7b521ced in clone () from /lib64/libc.so.6
No symbol table info available.

Thread 4 (Thread 0x7efc72cc8700 (LWP 32013)):
#0  0x00007efc7b7f8a82 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
No symbol table info available.
#1  0x000000000043060d in _fed_job_update_thread (arg=<optimized out>) at fed_mgr.c:2161
        err = <optimized out>
        ts = {tv_sec = 1531981562, tv_nsec = 0}
        job_update_info = <optimized out>
        __func__ = "_fed_job_update_thread"
#2  0x00007efc7b7f4dc5 in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#3  0x00007efc7b521ced in clone () from /lib64/libc.so.6
No symbol table info available.

Thread 3 (Thread 0x7efc72dc9700 (LWP 32012)):
#0  0x00007efc7b7f8a82 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
No symbol table info available.
#1  0x000000000042c963 in _agent_thread (arg=<optimized out>) at fed_mgr.c:2203
        err = <optimized out>
        cluster = <optimized out>
        ts = {tv_sec = 1531981562, tv_nsec = 0}
        cluster_iter = <optimized out>
        rpc_iter = <optimized out>
        rpc_rec = <optimized out>
        req_msg = {address = {sin_family = 0, sin_port = 0, sin_addr = {s_addr = 0}, sin_zero = "\000\000\000\000\000\000\000"}, auth_cred = 0x0, body_offset = 0, buffer = 0x0, conn = 0x0, conn_fd = 0, data = 0x0, data_size = 0, flags = 0, msg_index = 0, msg_type = 0, protocol_version = 0, forward = {cnt = 0, init = 0, nodelist = 0x0, timeout = 0, tree_width = 0}, forward_struct = 0x0, orig_addr = {sin_family = 0, sin_port = 0, sin_addr = {s_addr = 0}, sin_zero = "\000\000\000\000\000\000\000"}, ret_list = 0x0}
        resp_msg = {address = {sin_family = 0, sin_port = 0, sin_addr = {s_addr = 0}, sin_zero = "\000\000\000\000\000\000\000"}, auth_cred = 0x0, body_offset = 0, buffer = 0x0, conn = 0x0, conn_fd = 0, data = 0x0, data_size = 0, flags = 0, msg_index = 0, msg_type = 0, protocol_version = 0, forward = {cnt = 0, init = 0, nodelist = 0x0, timeout = 0, tree_width = 0}, forward_struct = 0x0, orig_addr = {sin_family = 0, sin_port = 0, sin_addr = {s_addr = 0}, sin_zero = "\000\000\000\000\000\000\000"}, ret_list = 0x0}
        ctld_req_msg = {my_list = 0x0}
        success_bits = <optimized out>
        rc = <optimized out>
        resp_inx = <optimized out>
        success_size = <optimized out>
        fed_read_lock = {config = NO_LOCK, job = NO_LOCK, node = NO_LOCK, partition = NO_LOCK, federation = READ_LOCK}
        __func__ = "_agent_thread"
#2  0x00007efc7b7f4dc5 in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#3  0x00007efc7b521ced in clone () from /lib64/libc.so.6
No symbol table info available.

Thread 2 (Thread 0x7efc726ba700 (LWP 32014)):
#0  0x00007efc7b4e8efd in nanosleep () from /lib64/libc.so.6
No symbol table info available.
#1  0x00007efc7b4e8d94 in sleep () from /lib64/libc.so.6
No symbol table info available.
#2  0x00007efc729c2596 in _decay_thread (no_data=<optimized out>) at priority_multifactor.c:1333
        start_time = 1531981347
        last_reset = 1469517764
        next_reset = 0
        calc_period = 300
        decay_hl = <optimized out>
        reset_period = 0
        now = 1531981347
        run_delta = <optimized out>
        real_decay = <optimized out>
        elapsed = <optimized out>
        job_write_lock = {config = NO_LOCK, job = WRITE_LOCK, node = READ_LOCK, partition = READ_LOCK, federation = NO_LOCK}
        locks = {assoc = WRITE_LOCK, file = NO_LOCK, qos = NO_LOCK, res = NO_LOCK, tres = NO_LOCK, user = NO_LOCK, wckey = NO_LOCK}
        __func__ = "_decay_thread"
#3  0x00007efc7b7f4dc5 in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#4  0x00007efc7b521ced in clone () from /lib64/libc.so.6
No symbol table info available.

Thread 1 (Thread 0x7efc735ed700 (LWP 32010)):
#0  0x00007efc7b4605f7 in raise () from /lib64/libc.so.6
No symbol table info available.
#1  0x00007efc7b461ce8 in abort () from /lib64/libc.so.6
No symbol table info available.
#2  0x00007efc7b459566 in __assert_fail_base () from /lib64/libc.so.6
No symbol table info available.
#3  0x00007efc7b459612 in __assert_fail () from /lib64/libc.so.6
No symbol table info available.
#4  0x00007efc7bc72a21 in bit_ffs (b=<optimized out>) at bitstring.c:475
        bit = 0
        value = -1
        __PRETTY_FUNCTION__ = "bit_ffs"
#5  0x00007efc79dfb8cc in _build_row_bitmaps (p_ptr=p_ptr@entry=0x7efc4c892110, job_ptr=job_ptr@entry=0x7efc44446bd0) at select_cons_res.c:677
        i = 0
        j = 0
        num_jobs = 8
        size = <optimized out>
        x = 0
        this_row = <optimized out>
        orig_row = 0x7efc4c8c60b0
        ss = 0x7efc4c7511a0
        __func__ = "_build_row_bitmaps"
#6  0x00007efc79dfc053 in _rm_job_from_res (part_record_ptr=part_record_ptr@entry=0x7efc4c0142c0, node_usage=node_usage@entry=0x7efc4c209940, job_ptr=job_ptr@entry=0x7efc44446bd0, action=action@entry=0) at select_cons_res.c:1294
        p_ptr = 0x7efc4c892110
        job = 0x7efc4c643a20
        node_ptr = <optimized out>
        first_bit = 0
        last_bit = <optimized out>
        i = <optimized out>
        n = <optimized out>
        gres_list = <optimized out>
        __func__ = "_rm_job_from_res"
#7  0x00007efc79dfd07c in _will_run_test (exc_core_bitmap=0x0, preemptee_job_list=0x7efc735ecab8, preemptee_candidates=0x0, job_node_req=<optimized out>, req_nodes=1, max_nodes=1, min_nodes=1, bitmap=0x7efc4c0feaa0, job_ptr=0x7efc4c0e5e90) at select_cons_res.c:1931
        first_job_ptr = 0x7efc44446bd0
        next_job_ptr = <optimized out>
        overlap = <optimized out>
        last_job_ptr = 0x7efc44446bd0
        rm_job_cnt = 0
        tv1 = {tv_sec = 1531981561, tv_usec = 105137}
        tv_str = '\000' <repeats 19 times>
        delta_t = 139621943865360
        time_window = 30
        more_jobs = true
        tv2 = {tv_sec = 139622874530176, tv_usec = 139622071926816}
        cr_job_list = 0x7efc44b91fb0
        tmp_cr_type = 20
        future_part = 0x7efc4c0142c0
        tmp_job_ptr = 0x7efc44446bd0
        preemptee_iterator = <optimized out>
        orig_map = 0x7efc4c014300
        qos_preemptor = false
        future_usage = 0x7efc4c209940
        job_iterator = 0x7efc74000990
        action = <optimized out>
        rc = -1
        now = 1531981561
#8  select_p_job_test (job_ptr=0x7efc4c0e5e90, bitmap=0x7efc4c0feaa0, min_nodes=1, max_nodes=1, req_nodes=1, mode=<optimized out>, preemptee_candidates=0x0, preemptee_job_list=0x7efc735ecab8, exc_core_bitmap=0x0) at select_cons_res.c:2310
        rc = 22
        debug_cpu_bind = false
        debug_check = true
#9  0x00007efc7bca9a3c in select_g_job_test (job_ptr=job_ptr@entry=0x7efc4c0e5e90, bitmap=0x7efc4c0feaa0, min_nodes=min_nodes@entry=1, max_nodes=max_nodes@entry=1, req_nodes=req_nodes@entry=1, mode=mode@entry=2, preemptee_candidates=preemptee_candidates@entry=0x0, preemptee_job_list=preemptee_job_list@entry=0x7efc735ecab8, exc_core_bitmap=exc_core_bitmap@entry=0x0) at node_select.c:582
No locals.
#10 0x00007efc735f2f39 in _try_sched (job_ptr=job_ptr@entry=0x7efc4c0e5e90, avail_bitmap=avail_bitmap@entry=0x7efc735ecdf8, min_nodes=1, max_nodes=1, req_nodes=1, exc_core_bitmap=0x0) at backfill.c:482
        orig_shared = 254
        now = 1531981561
        str = "\300\000\000\000\000\000\000\000\a\000\000\000\000\000\000\000\002\000\000\000\000\000\000\000\070\000\000\000\000\000\000\000(\000\000\000\000\000\000\000\340\063\326{\374~\000\000\247\000\000\000\000\000\000\000e9\326{\374~\000\000\305\065P[\000\000\000\000\301I\325{\374~\000\000\240\316^s\374~\000\000\240\000\000\000\000\000\000\000\240\316^s"
        low_bitmap = 0x0
        tmp_bitmap = 0x7efc4c014280
        rc = 0
        has_xor = false
        feat_cnt = 0
        detail_ptr = <optimized out>
        preemptee_candidates = 0x0
        preemptee_job_list = 0x0
        feat_iter = <optimized out>
        feat_ptr = <optimized out>
        __func__ = "_try_sched"
#11 0x00007efc735f5677 in _attempt_backfill () at backfill.c:1894
        bf_job_id = <optimized out>
        bf_array_task_id = <optimized out>
        bf_job_priority = <optimized out>
        tv1 = {tv_sec = 1531981560, tv_usec = 876382}
        tv2 = {tv_sec = 0, tv_usec = 139622873694297}
        tv_str = '\000' <repeats 19 times>
        delta_t = 139622873694297
        job_queue = <optimized out>
        job_queue_rec = 0x0
        bb = <optimized out>
        i = <optimized out>
        j = <optimized out>
        k = <optimized out>
        node_space_recs = <optimized out>
        mcs_select = <optimized out>
        qos_ptr = <optimized out>
        job_ptr = 0x7efc4c0e5e90
        part_ptr = <optimized out>
        bf_part_ptr = 0x0
        end_time = 1531983301
        end_reserve = <optimized out>
        deadline_time_limit = <optimized out>
        boot_time = 0
        orig_end_time = <optimized out>
        time_limit = <optimized out>
        comp_time_limit = <optimized out>
        orig_time_limit = <optimized out>
        part_time_limit = <optimized out>
        min_nodes = 1
        max_nodes = 1
        req_nodes = 1
        active_bitmap = 0x0
        avail_bitmap = 0x7efc4c0feaa0
        exc_core_bitmap = 0x0
        resv_bitmap = 0x7efc4c00da50
        now = 1531981561
        sched_start = <optimized out>
        later_start = 0
        start_res = 1531981561
        resv_end = <optimized out>
        window_end = <optimized out>
        orig_sched_start = <optimized out>
        orig_start_time = <optimized out>
        node_space = 0x7efc4c0f26a0
        bf_user_part_ptr = 0x0
        bf_time1 = {tv_sec = 1531981560, tv_usec = 877289}
        bf_time2 = {tv_sec = 1531981530, tv_usec = 876177}
        rc = 0
        error_code = <optimized out>
        job_test_count = <optimized out>
        test_time_count = <optimized out>
        pend_time = <optimized out>
        uid = 0x0
        nuser = <optimized out>
        bf_parts = <optimized out>
        bf_part_jobs = 0x0
        bf_part_resv = 0x0
        njobs = 0x0
        already_counted = true
        reject_array_job_id = <optimized out>
        reject_array_part = <optimized out>
        job_start_cnt = <optimized out>
        start_time = <optimized out>
        config_update = <optimized out>
        part_update = <optimized out>
        start_tv = {tv_sec = 1531981560, tv_usec = 876400}
        test_array_job_id = <optimized out>
        test_array_count = <optimized out>
        job_no_reserve = <optimized out>
        resv_overlap = true
        save_share_res = <optimized out>
        save_whole_node = <optimized out>
        test_fini = -1
        user_part_inx1 = <optimized out>
        user_part_inx2 = <optimized out>
        part_inx = <optimized out>
        user_inx = <optimized out>
        qos_flags = <optimized out>
        qos_blocked_until = <optimized out>
        qos_part_blocked_until = <optimized out>
        qos_read_lock = {assoc = NO_LOCK, file = NO_LOCK, qos = READ_LOCK, res = NO_LOCK, tres = NO_LOCK, user = NO_LOCK, wckey = NO_LOCK}
        __func__ = "_attempt_backfill"
#12 0x00007efc735f7bc0 in backfill_agent (args=<optimized out>) at backfill.c:904
        now = <optimized out>
        wait_time = <optimized out>
        last_backfill_time = 1531981530
        all_locks = {config = READ_LOCK, job = WRITE_LOCK, node = WRITE_LOCK, partition = READ_LOCK, federation = READ_LOCK}
        load_config = <optimized out>
        short_sleep = <optimized out>
        backfill_cnt = 23555
        __func__ = "backfill_agent"
#13 0x00007efc7b7f4dc5 in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#14 0x00007efc7b521ced in clone () from /lib64/libc.so.6
No symbol table info available.



Thanks

Damien
Comment 9 Dominik Bartkiewicz 2018-07-20 07:04:58 MDT
Hi

Could you use interactively gdb on core.32003  and run these commands?
t 1
f 7
p tmp_job_ptr
p rm_job_cnt

Dominik
Comment 10 Damien 2018-07-20 07:36:45 MDT
Hi Dominik

I’m not familiar with gdb. Can you give me more details?

Thanks

Damien


On Friday, 20 July 2018, <bugs@schedmd.com> wrote:

> *Comment # 9 <https://bugs.schedmd.com/show_bug.cgi?id=5452#c9> on bug
> 5452 <https://bugs.schedmd.com/show_bug.cgi?id=5452> from Dominik
> Bartkiewicz <bart@schedmd.com> *
>
> Hi
>
> Could you use interactively gdb on core.32003  and run these commands?
> t 1
> f 7
> p tmp_job_ptr
> p rm_job_cnt
>
> Dominik
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>
Comment 11 Dominik Bartkiewicz 2018-07-20 08:01:21 MDT
Hi

Of course
gdb /opt/slurm-17.11.4/sbin/slurmctld core.32003
then you should see propt like this "(gdb)"

Go to thread 1
t 1
then pick frame 7
f 7
And you can print some values
p tmp_job_ptr
p rm_job_cnt

Dominik
Comment 12 Dominik Bartkiewicz 2018-07-20 09:36:29 MDT
Hi

Let me know if this is clear.
Could you send me the value of ss[x].tmpjobs?
eg.:
thread 1
frame 5
print ss[x].tmpjobs

Dominik
Comment 13 Damien 2018-07-22 01:22:30 MDT
Hi Dominik

There are the values:

 tmp]# gdb /opt/slurm-17.11.4/sbin/slurmctld  core.32003 
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-80.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /opt/slurm-17.11.4/sbin/slurmctld...done.
[New LWP 32010]
[New LWP 32014]
[New LWP 32012]
[New LWP 32013]
[New LWP 32011]
[New LWP 32015]
[New LWP 32022]
[New LWP 32016]
[New LWP 32020]
[New LWP 32019]
[New LWP 32004]
[New LWP 32003]
[New LWP 32005]
[New LWP 32006]
[New LWP 32008]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/opt/slurm-17.11.4/sbin/slurmctld'.
Program terminated with signal 6, Aborted.
#0  0x00007efc7b4605f7 in raise () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install glibc-2.17-106.el7_2.8.x86_64 sssd-client-1.13.0-40.el7_2.12.x86_64
(gdb) t 1
[Switching to thread 1 (Thread 0x7efc735ed700 (LWP 32010))]
#0  0x00007efc7b4605f7 in raise () from /lib64/libc.so.6
(gdb) f 7
#7  0x00007efc79dfd07c in _will_run_test (exc_core_bitmap=0x0, preemptee_job_list=0x7efc735ecab8, preemptee_candidates=0x0, 
    job_node_req=<optimized out>, req_nodes=1, max_nodes=1, min_nodes=1, bitmap=0x7efc4c0feaa0, job_ptr=0x7efc4c0e5e90) at select_cons_res.c:1931
1931	select_cons_res.c: No such file or directory.
(gdb) p tmp_job_ptr
$1 = (struct job_record *) 0x7efc44446bd0
(gdb) p rm_job_cnt
$2 = 0
(gdb)
Comment 14 Damien 2018-07-22 01:24:58 MDT
Hi Dominik

There are the extras:

(gdb) thread 1
[Switching to thread 1 (Thread 0x7efc735ed700 (LWP 32010))]
#7  0x00007efc79dfd07c in _will_run_test (exc_core_bitmap=0x0, preemptee_job_list=0x7efc735ecab8, preemptee_candidates=0x0, 
    job_node_req=<optimized out>, req_nodes=1, max_nodes=1, min_nodes=1, bitmap=0x7efc4c0feaa0, job_ptr=0x7efc4c0e5e90) at select_cons_res.c:1931
1931	in select_cons_res.c
(gdb) frame 5
#5  0x00007efc79dfb8cc in _build_row_bitmaps (p_ptr=p_ptr@entry=0x7efc4c892110, job_ptr=job_ptr@entry=0x7efc44446bd0) at select_cons_res.c:677
677	in select_cons_res.c
(gdb) print ss[x].tmpjobs
$7 = (struct job_resources *) 0x7efc44660740
(gdb) 


I hope that this is sufficient, else please let us know.


Many Thanks


Damien
Comment 15 Dominik Bartkiewicz 2018-07-22 03:12:05 MDT
Hi

Thank you
I appreciate your efforts and patience.
I should already ask for this, could you attach this? 

thread 1
frame 5
info locals
print *(ss[x].tmpjobs)

t 1
f 7
info locals
print *tmp_job_ptr

Dominik
Comment 16 Damien 2018-07-22 06:13:55 MDT
Hi Dominik

There you goes:


(gdb) 
(gdb) thread 1
[Switching to thread 1 (Thread 0x7efc735ed700 (LWP 32010))]
#0  0x00007efc7b4605f7 in raise () from /lib64/libc.so.6
(gdb) frame 5
#5  0x00007efc79dfb8cc in _build_row_bitmaps (p_ptr=p_ptr@entry=0x7efc4c892110, job_ptr=job_ptr@entry=0x7efc44446bd0) at select_cons_res.c:677
677	select_cons_res.c: No such file or directory.
(gdb) info locals
i = 0
j = 0
num_jobs = 8
size = <optimized out>
x = 0
this_row = <optimized out>
orig_row = 0x7efc4c8c60b0
ss = 0x7efc4c7511a0
__func__ = "_build_row_bitmaps"
(gdb) print *(ss[x].tmpjobs)
$1 = {core_bitmap = 0x0, core_bitmap_used = 0x0, cpu_array_cnt = 1, cpu_array_value = 0x0, cpu_array_reps = 0x0, cpus = 0x0, cpus_used = 0x0, 
  cores_per_socket = 0x0, memory_allocated = 0x0, memory_used = 0x0, nhosts = 1, node_bitmap = 0x0, node_req = 1, nodes = 0x0, ncpus = 1, 
  sock_core_rep_count = 0x0, sockets_per_node = 0x0, whole_node = 0 '\000'}
(gdb) 
$2 = {core_bitmap = 0x0, core_bitmap_used = 0x0, cpu_array_cnt = 1, cpu_array_value = 0x0, cpu_array_reps = 0x0, cpus = 0x0, cpus_used = 0x0, 
  cores_per_socket = 0x0, memory_allocated = 0x0, memory_used = 0x0, nhosts = 1, node_bitmap = 0x0, node_req = 1, nodes = 0x0, ncpus = 1, 
  sock_core_rep_count = 0x0, sockets_per_node = 0x0, whole_node = 0 '\000'}
(gdb) 
$3 = {core_bitmap = 0x0, core_bitmap_used = 0x0, cpu_array_cnt = 1, cpu_array_value = 0x0, cpu_array_reps = 0x0, cpus = 0x0, cpus_used = 0x0, 
  cores_per_socket = 0x0, memory_allocated = 0x0, memory_used = 0x0, nhosts = 1, node_bitmap = 0x0, node_req = 1, nodes = 0x0, ncpus = 1, 
  sock_core_rep_count = 0x0, sockets_per_node = 0x0, whole_node = 0 '\000'}
(gdb) t 1
[Switching to thread 1 (Thread 0x7efc735ed700 (LWP 32010))]
#5  0x00007efc79dfb8cc in _build_row_bitmaps (p_ptr=p_ptr@entry=0x7efc4c892110, job_ptr=job_ptr@entry=0x7efc44446bd0) at select_cons_res.c:677
677	in select_cons_res.c
(gdb) f 5
#5  0x00007efc79dfb8cc in _build_row_bitmaps (p_ptr=p_ptr@entry=0x7efc4c892110, job_ptr=job_ptr@entry=0x7efc44446bd0) at select_cons_res.c:677
677	in select_cons_res.c
(gdb) f 7
#7  0x00007efc79dfd07c in _will_run_test (exc_core_bitmap=0x0, preemptee_job_list=0x7efc735ecab8, preemptee_candidates=0x0, 
    job_node_req=<optimized out>, req_nodes=1, max_nodes=1, min_nodes=1, bitmap=0x7efc4c0feaa0, job_ptr=0x7efc4c0e5e90) at select_cons_res.c:1931
1931	in select_cons_res.c
(gdb) info locals
first_job_ptr = 0x7efc44446bd0
next_job_ptr = <optimized out>
overlap = <optimized out>
last_job_ptr = 0x7efc44446bd0
rm_job_cnt = 0
tv1 = {tv_sec = 1531981561, tv_usec = 105137}
tv_str = '\000' <repeats 19 times>
delta_t = 139621943865360
time_window = 30
more_jobs = true
tv2 = {tv_sec = 139622874530176, tv_usec = 139622071926816}
cr_job_list = 0x7efc44b91fb0
tmp_cr_type = 20
future_part = 0x7efc4c0142c0
tmp_job_ptr = 0x7efc44446bd0
preemptee_iterator = <optimized out>
orig_map = 0x7efc4c014300
qos_preemptor = false
future_usage = 0x7efc4c209940
job_iterator = 0x7efc74000990
action = <optimized out>
rc = -1
now = 1531981561
(gdb) print *tmp_job_ptr
$4 = {account = 0x7efc4423e7c0 "ax22", admin_comment = 0x0, alias_list = 0x0, alloc_node = 0x7efc446f8000 "m3-login2", alloc_resp_port = 0, 
  alloc_sid = 667, array_job_id = 2814050, array_task_id = 1, array_recs = 0x0, assoc_id = 1337, assoc_ptr = 0x111d1c0, batch_flag = 1, 
  batch_host = 0x7efc4c22a170 "m3a000", billable_tres = 6, bit_flags = 0, burst_buffer = 0x0, burst_buffer_state = 0x0, check_job = 0x0, 
  ckpt_interval = 0, ckpt_time = 0, clusters = 0x0, comment = 0x0, cpu_cnt = 6, cr_enabled = 1, db_index = 0, deadline = 0, delay_boot = 0, 
  derived_ec = 0, details = 0x7efc442249f0, direct_set_prio = 0, end_time = 1531983301, end_time_exp = 1531983301, epilog_running = false, 
  exit_code = 0, fed_details = 0x0, front_end_ptr = 0x0, gids = 0x0, gres = 0x0, gres_list = 0x0, gres_alloc = 0x7efc4c40b0d0 "", 
  gres_detail_cnt = 0, gres_detail_str = 0x0, gres_req = 0x7efc4c2653c0 "", gres_used = 0x0, group_id = 10025, job_id = 2814051, job_next = 0x0, 
  job_array_next_j = 0x0, job_array_next_t = 0x0, job_resrcs = 0x7efc4c643a20, job_state = 1, kill_on_node_fail = 1, 
  last_sched_eval = 1531981561, licenses = 0x0, license_list = 0x0, limit_set = {qos = 0, time = 0, tres = 0x7efc4414c940}, mail_type = 0, 
  mail_user = 0x0, magic = 4038539564, mcs_label = 0x0, name = 0x7efc446f7fd0 "seecr19july", network = 0x0, next_step_id = 0, ngids = 0, 
  nodes = 0x7efc4c2653a0 "m3a000", node_addr = 0x7efc4c119940, node_bitmap = 0x7efc4c148520, node_bitmap_cg = 0x0, node_cnt = 1, 
  node_cnt_wag = 1, nodes_completing = 0x0, origin_cluster = 0x0, other_port = 0, pack_job_id = 0, pack_job_id_set = 0x0, pack_job_offset = 0, 
  pack_job_list = 0x0, partition = 0x7efc444ba010 "short", part_ptr_list = 0x0, part_nodes_missing = false, part_ptr = 0x7efc44863820, 
  power_flags = 0 '\000', pre_sus_time = 0, preempt_time = 0, preempt_in_progress = false, priority = 73105, priority_array = 0x0, 
  prio_factors = 0x7efc446f7f40, profile = 4294967295, qos_id = 1, qos_ptr = 0x10a3920, qos_blocking_ptr = 0x0, reboot = 0 '\000', 
  restart_cnt = 0, resize_time = 0, resv_id = 0, resv_name = 0x0, resv_ptr = 0x0, requid = 4294967295, resp_host = 0x0, sched_nodes = 0x0, 
  select_jobinfo = 0x7efc4489c500, spank_job_env = 0x0, spank_job_env_size = 0, start_protocol_ver = 8192, start_time = 1531981561, 
  state_desc = 0x0, state_reason = 0, state_reason_prev = 0, step_list = 0x7efc449009c0, suspend_time = 0, time_last_active = 1531981561, 
  time_limit = 29, time_min = 0, tot_sus_time = 0, total_cpus = 6, total_nodes = 1, tres_req_cnt = 0x7efc44c1d9b0, 
  tres_req_str = 0x7efc445ecca0 "1=6,2=8000,4=1", tres_fmt_req_str = 0x7efc4428bd50 "cpu=6,mem=8000M,node=1", tres_alloc_cnt = 0x7efc4c645f00, 
  tres_alloc_str = 0x7efc4c0bd590 "1=6,2=8000,3=18446744073709551614,4=1,5=6", 
  tres_fmt_alloc_str = 0x7efc4c40b040 "cpu=6,mem=8000M,node=1,billing=6", user_id = 11014, user_name = 0x0, wait_all_nodes = 0, warn_flags = 0, 
  warn_signal = 0, warn_time = 0, wckey = 0x0, req_switch = 0, wait4switch = 0, best_switch = true, wait4switch_start = 0}
(gdb) 


I hope that you find the problem.

Many Thanks.

Damien
Comment 17 Dominik Bartkiewicz 2018-07-24 02:21:14 MDT
Hi

We are still investigating this issue.
Does this still occur?

Dominik
Comment 18 Damien 2018-07-24 21:20:04 MDT
Hi Dominik 

It has not reappear now, but this has crashed twice last Thursday night, and once about a month ago.

We are looking whether there is a preventative measure that we can use, or whether it is a CPU-load issue or configuration problem ?



Cheers

Damien
Comment 29 Dominik Bartkiewicz 2018-07-27 03:36:27 MDT
Hi

This patch should fix this issue. 
It hasn't been committed yet, but we think it will be soon in this or similar form.

Dominik
Comment 34 Dominik Bartkiewicz 2018-07-30 03:17:57 MDT
Hi

This is fixed in commit:
https://github.com/SchedMD/slurm/commit/fef07a409724
I'm going to go ahead and mark this as Resolved/Fixed, please feel free to re-open this if there's anything else we can help with.

Dominik
Comment 35 Marshall Garey 2018-08-02 08:43:23 MDT
*** Ticket 5447 has been marked as a duplicate of this ticket. ***
Comment 36 Marshall Garey 2018-08-21 10:12:57 MDT
*** Ticket 5438 has been marked as a duplicate of this ticket. ***
Comment 37 Marshall Garey 2018-09-12 09:05:03 MDT
*** Ticket 5675 has been marked as a duplicate of this ticket. ***