|
Description
Damien
2018-07-19 06:58:43 MDT
Hi Could you use gdb and generate the backtrace? eg.: gdb -batch -ex "thread apply all bt full" <core file> Dominik Created attachment 7354 [details]
Zipped slurmctld log
Created attachment 7355 [details]
core file
Hi Without all binary and libs, core file is useless. You must generate backtrace on slurmctld machine, you can do it like in comment 1. Dominik There is two core files today: [root@m3-mgmt2 slurm-logs]# gdb -batch -ex "thread apply all bt full" core.14684 [New LWP 31599] [New LWP 14686] [New LWP 14823] [New LWP 14689] [New LWP 14830] [New LWP 14684] [New LWP 14685] [New LWP 14687] [New LWP 14690] [New LWP 14696] [New LWP 14744] [New LWP 14745] [New LWP 14818] [New LWP 14832] [New LWP 14828] [New LWP 14829] Missing separate debuginfo for the main executable file Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/71/a7c60e3a83c09c01aec6f05752aef7f4e632e4 Core was generated by `/opt/slurm-17.11.4/sbin/slurmctld'. Program terminated with signal 6, Aborted. #0 0x00007f984fdae5f7 in ?? () "/mnt/slurm-logs/core.14684" is a core file. Please specify an executable to debug. Thread 16 (LWP 14829): #0 0x00007f9850149e91 in ?? () No symbol table info available. #1 0x0000000000000000 in ?? () No symbol table info available. Thread 15 (LWP 14828): #0 0x00007f984fe67413 in ?? () No symbol table info available. #1 0x0000000000000002 in ?? () No symbol table info available. #2 0x0000000000424f09 in ?? () No symbol table info available. #3 0x0000000000000001 in ?? () No symbol table info available. #4 0x0000000000000000 in ?? () No symbol table info available. Thread 14 (LWP 14832): #0 0x00007f98501466d5 in ?? () No symbol table info available. #1 0x0000000000000000 in ?? () No symbol table info available. Thread 13 (LWP 14818): #0 0x00007f984fe36efd in ?? () No symbol table info available. #1 0x0000000000000002 in ?? () No symbol table info available. #2 0x00007f984fe36d94 in ?? () No symbol table info available. #3 0x0000000000000030 in ?? () No symbol table info available. #4 0x000000000b642851 in ?? () No symbol table info available. #5 0x0000000000010000 in ?? () No symbol table info available. #6 0x0000000000000000 in ?? () No symbol table info available. Thread 12 (LWP 14745): #0 0x00007f9850146a82 in ?? () No symbol table info available. #1 0x0000000000000000 in ?? () No symbol table info available. Thread 11 (LWP 14744): #0 0x00007f9850146a82 in ?? () No symbol table info available. #1 0x0000000000000000 in ?? () No symbol table info available. Thread 10 (LWP 14696): #0 0x00007f9850146a82 in ?? () No symbol table info available. #1 0x0000000000000000 in ?? () No symbol table info available. Thread 9 (LWP 14690): #0 0x00007f9850146a82 in ?? () No symbol table info available. #1 0x0000000000000000 in ?? () No symbol table info available. Thread 8 (LWP 14687): #0 0x00007f9850143ef7 in ?? () No symbol table info available. #1 0x00007f9850143e30 in ?? () No symbol table info available. #2 0x00007f984d11ad28 in ?? () No symbol table info available. #3 0x00007f984d019700 in ?? () No symbol table info available. #4 0x0000000000000000 in ?? () No symbol table info available. Thread 7 (LWP 14685): #0 0x00007f9850146a82 in ?? () No symbol table info available. #1 0x0000100e00000000 in ?? () No symbol table info available. #2 0x00000000006dcfe0 in ?? () No symbol table info available. #3 0x00000000006dd020 in ?? () No symbol table info available. #4 0x0000000000001084 in ?? () No symbol table info available. #5 0x00007f9848000930 in ?? () No symbol table info available. #6 0x0000000000000000 in ?? () No symbol table info available. Thread 6 (LWP 14684): #0 0x00007f984fe36efd in ?? () No symbol table info available. #1 0x0000000000000002 in ?? () No symbol table info available. #2 0x00007f984fe67b34 in ?? () No symbol table info available. #3 0x0000000000000000 in ?? () No symbol table info available. Thread 5 (LWP 14830): #0 0x00007f98501466d5 in ?? () No symbol table info available. #1 0x0000026b00000000 in ?? () No symbol table info available. #2 0x00000000006de260 in ?? () No symbol table info available. #3 0x00000000006de2a0 in ?? () No symbol table info available. #4 0x000000000000ccb0 in ?? () No symbol table info available. #5 0x0000000000000000 in ?? () No symbol table info available. Thread 4 (LWP 14689): #0 0x00007f9850146a82 in ?? () No symbol table info available. #1 0x0000010500000000 in ?? () No symbol table info available. #2 0x00007f985090cbc0 in ?? () No symbol table info available. #3 0x00007f985090cc00 in ?? () No symbol table info available. #4 0x0000000000000218 in ?? () No symbol table info available. #5 0x0000000000000000 in ?? () No symbol table info available. Thread 3 (LWP 14823): #0 0x00007f9850143ef7 in ?? () No symbol table info available. #1 0x00007f9850143e30 in ?? () No symbol table info available. #2 0x00007f98471d5d28 in ?? () No symbol table info available. #3 0x00007f9846ed2700 in ?? () No symbol table info available. #4 0x0000000000000000 in ?? () No symbol table info available. Thread 2 (LWP 14686): #0 0x00007f984fe36efd in ?? () No symbol table info available. #1 0x0000000000000002 in ?? () No symbol table info available. #2 0x00007f984fe36d94 in ?? () No symbol table info available. #3 0x0000000000000003 in ?? () No symbol table info available. #4 0x000000003785cb7a in ?? () No symbol table info available. #5 0x0000000000010000 in ?? () No symbol table info available. #6 0x0000000000000000 in ?? () No symbol table info available. Thread 1 (LWP 31599): #0 0x00007f984fdae5f7 in ?? () No symbol table info available. #1 0x00007f984fdafce8 in ?? () No symbol table info available. #2 0x0000000000000020 in ?? () No symbol table info available. #3 0x0000000000000000 in ?? () No symbol table info available. And [root@m3-mgmt2 slurm-logs]# gdb -batch -ex "thread apply all bt full" core.32003 [New LWP 32010] [New LWP 32014] [New LWP 32012] [New LWP 32013] [New LWP 32011] [New LWP 32015] [New LWP 32022] [New LWP 32016] [New LWP 32020] [New LWP 32019] [New LWP 32004] [New LWP 32003] [New LWP 32005] [New LWP 32006] [New LWP 32008] Missing separate debuginfo for the main executable file Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/71/a7c60e3a83c09c01aec6f05752aef7f4e632e4 Core was generated by `/opt/slurm-17.11.4/sbin/slurmctld'. Program terminated with signal 6, Aborted. #0 0x00007efc7b4605f7 in ?? () "/mnt/slurm-logs/core.32003" is a core file. Please specify an executable to debug. Thread 15 (LWP 32008): #0 0x00007efc7b7f8a82 in ?? () No symbol table info available. #1 0x0002648a00000000 in ?? () No symbol table info available. #2 0x00007efc7bfbebc0 in ?? () No symbol table info available. #3 0x00007efc7bfbec00 in ?? () No symbol table info available. #4 0x000000000002c2a3 in ?? () No symbol table info available. #5 0x0000000000000000 in ?? () No symbol table info available. Thread 14 (LWP 32006): #0 0x00007efc7b7f5ef7 in ?? () No symbol table info available. #1 0x00007efc7b7f5e30 in ?? () No symbol table info available. #2 0x00007efc787ccd28 in ?? () No symbol table info available. #3 0x00007efc786cb700 in ?? () No symbol table info available. #4 0x0000000000000000 in ?? () No symbol table info available. Thread 13 (LWP 32005): #0 0x00007efc7b4e8efd in ?? () No symbol table info available. #1 0x0000000000000002 in ?? () No symbol table info available. #2 0x00007efc7b4e8d94 in ?? () No symbol table info available. #3 0x0000000000000001 in ?? () No symbol table info available. #4 0x000000001ac89147 in ?? () No symbol table info available. #5 0x0000000000010000 in ?? () No symbol table info available. #6 0x0000000000000000 in ?? () No symbol table info available. Thread 12 (LWP 32003): #0 0x00007efc7b4e8efd in ?? () No symbol table info available. #1 0x0000000000000002 in ?? () No symbol table info available. #2 0x00007efc7b519b34 in ?? () No symbol table info available. #3 0x0000000000000000 in ?? () No symbol table info available. Thread 11 (LWP 32004): #0 0x00007efc7b7f86d5 in ?? () No symbol table info available. #1 0x0005893000000000 in ?? () No symbol table info available. #2 0x00000000006ddd00 in ?? () No symbol table info available. #3 0x00000000006ddd40 in ?? () No symbol table info available. #4 0x00000000001bf787 in ?? () No symbol table info available. #5 0x0000000000101000 in ?? () No symbol table info available. #6 0x0000000000464f20 in ?? () No symbol table info available. #7 0x0000000000000000 in ?? () No symbol table info available. Thread 10 (LWP 32019): #0 0x00007efc7b7fbe91 in ?? () No symbol table info available. #1 0x0000000000000000 in ?? () No symbol table info available. Thread 9 (LWP 32020): #0 0x00007efc7b7f86d5 in ?? () No symbol table info available. #1 0x0002b50a00000000 in ?? () No symbol table info available. #2 0x00000000006de260 in ?? () No symbol table info available. #3 0x00000000006de2a0 in ?? () No symbol table info available. #4 0x00000000014738ae in ?? () No symbol table info available. #5 0x0000000000000000 in ?? () No symbol table info available. Thread 8 (LWP 32016): #0 0x00007efc7b519413 in ?? () No symbol table info available. #1 0x0000000000000002 in ?? () No symbol table info available. #2 0x0000000000424f09 in ?? () No symbol table info available. #3 0x0000000000000001 in ?? () No symbol table info available. #4 0x0000000000000000 in ?? () No symbol table info available. Thread 7 (LWP 32022): #0 0x00007efc7b7f86d5 in ?? () No symbol table info available. #1 0x0000000000000000 in ?? () No symbol table info available. Thread 6 (LWP 32015): #0 0x00007efc7b7f5ef7 in ?? () No symbol table info available. #1 0x00007efc7b7f5e30 in ?? () No symbol table info available. #2 0x00007efc726bad28 in ?? () No symbol table info available. #3 0x00007efc725b9700 in ?? () No symbol table info available. #4 0x0000000000000000 in ?? () No symbol table info available. Thread 5 (LWP 32011): #0 0x00007efc7b7f8a82 in ?? () No symbol table info available. #1 0x0000000000000000 in ?? () No symbol table info available. Thread 4 (LWP 32013): #0 0x00007efc7b7f8a82 in ?? () No symbol table info available. #1 0x0000000000000000 in ?? () No symbol table info available. Thread 3 (LWP 32012): #0 0x00007efc7b7f8a82 in ?? () No symbol table info available. #1 0x0000000000000000 in ?? () No symbol table info available. Thread 2 (LWP 32014): #0 0x00007efc7b4e8efd in ?? () No symbol table info available. #1 0x0000000000000002 in ?? () No symbol table info available. #2 0x00007efc7b4e8d94 in ?? () No symbol table info available. #3 0x0000000000000056 in ?? () No symbol table info available. #4 0x000000001d8da290 in ?? () No symbol table info available. #5 0x0000000000010000 in ?? () No symbol table info available. #6 0x0000000000000000 in ?? () No symbol table info available. Thread 1 (LWP 32010): #0 0x00007efc7b4605f7 in ?? () No symbol table info available. #1 0x00007efc7b461ce8 in ?? () No symbol table info available. #2 0x0000000000000020 in ?? () No symbol table info available. #3 0x0000000000000000 in ?? () No symbol table info available. Here is more info: scontrol show config Configuration data as of 2018-07-19T23:23:40 AccountingStorageBackupHost = m3-mgmt1 AccountingStorageEnforce = associations,limits,qos AccountingStorageHost = m3-mgmt2 AccountingStorageLoc = N/A AccountingStoragePort = 6819 AccountingStorageTRES = cpu,mem,energy,node,billing,gres/gpu AccountingStorageType = accounting_storage/slurmdbd AccountingStorageUser = N/A AccountingStoreJobComment = Yes AcctGatherEnergyType = acct_gather_energy/none AcctGatherFilesystemType = acct_gather_filesystem/none AcctGatherInterconnectType = acct_gather_interconnect/none AcctGatherNodeFreq = 0 sec AcctGatherProfileType = acct_gather_profile/none AllowSpecResourcesUsage = 0 AuthInfo = (null) AuthType = auth/munge BackupAddr = m3-mgmt1 BackupController = m3-mgmt1 BatchStartTimeout = 10 sec BOOT_TIME = 2018-07-19T18:11:26 BurstBufferType = (null) CheckpointType = checkpoint/none ChosLoc = (null) ClusterName = m3 CompleteWait = 10 sec ControlAddr = m3-mgmt2 ControlMachine = m3-mgmt2 CoreSpecPlugin = core_spec/none CpuFreqDef = Unknown CpuFreqGovernors = Performance,OnDemand CryptoType = crypto/munge DebugFlags = Gres DefMemPerNode = UNLIMITED DisableRootJobs = Yes EioTimeout = 60 EnforcePartLimits = ALL Epilog = /opt/slurm/etc/slurm.epilog EpilogMsgTime = 2000 usec EpilogSlurmctld = (null) ExtSensorsType = ext_sensors/none ExtSensorsFreq = 0 sec FairShareDampeningFactor = 1 FastSchedule = 1 FederationParameters = (null) FirstJobId = 1 GetEnvTimeout = 2 sec GresTypes = gpu GroupUpdateForce = 1 GroupUpdateTime = 600 sec HASH_VAL = Match HealthCheckInterval = 300 sec HealthCheckNodeState = ANY HealthCheckProgram = /opt/nhc-1.4.2/sbin/nhc InactiveLimit = 0 sec JobAcctGatherFrequency = 30 JobAcctGatherType = jobacct_gather/linux JobAcctGatherParams = (null) JobCheckpointDir = /var/slurm/checkpoint JobCompHost = localhost JobCompLoc = /var/log/slurm_jobcomp.log JobCompPort = 0 JobCompType = jobcomp/none JobCompUser = root JobContainerType = job_container/none JobCredentialPrivateKey = (null) JobCredentialPublicCertificate = (null) JobFileAppend = 0 JobRequeue = 1 JobSubmitPlugins = (null) KeepAliveTime = SYSTEM_DEFAULT KillOnBadExit = 0 KillWait = 10 sec LaunchParameters = (null) LaunchType = launch/slurm Layouts = Licenses = (null) LicensesUsed = (null) LogTimeFormat = iso8601_ms MailDomain = (null) MailProg = /bin/mail MaxArraySize = 1001 MaxJobCount = 15000 MaxJobId = 67043328 MaxMemPerNode = UNLIMITED MaxStepCount = 40000 MaxTasksPerNode = 512 MCSPlugin = mcs/none MCSParameters = (null) MemLimitEnforce = Yes MessageTimeout = 10 sec MinJobAge = 300 sec MpiDefault = pmi2 MpiParams = ports=12000-12999 MsgAggregationParams = (null) NEXT_JOB_ID = 2815470 NodeFeaturesPlugins = (null) OverTimeLimit = 1 min PluginDir = /opt/slurm-17.11.4/lib/slurm PlugStackConfig = /opt/slurm-17.11.4/etc/plugstack.conf PowerParameters = (null) PowerPlugin = PreemptMode = REQUEUE PreemptType = preempt/qos PriorityParameters = (null) PriorityDecayHalfLife = 14-00:00:00 PriorityCalcPeriod = 00:05:00 PriorityFavorSmall = Yes PriorityFlags = FAIR_TREE PriorityMaxAge = 14-00:00:00 PriorityUsageResetPeriod = NONE PriorityType = priority/multifactor PriorityWeightAge = 10000 PriorityWeightFairShare = 80000 PriorityWeightJobSize = 10000 PriorityWeightPartition = 10000 PriorityWeightQOS = 60000 PriorityWeightTRES = (null) PrivateData = none ProctrackType = proctrack/cgroup Prolog = /opt/slurm/etc/slurm.prolog PrologEpilogTimeout = 65534 PrologSlurmctld = (null) PrologFlags = (null) PropagatePrioProcess = 0 PropagateResourceLimits = ALL PropagateResourceLimitsExcept = (null) RebootProgram = (null) ReconfigFlags = (null) RequeueExit = (null) RequeueExitHold = (null) ResumeProgram = (null) ResumeRate = 300 nodes/min ResumeTimeout = 60 sec ResvEpilog = (null) ResvOverRun = 0 min ResvProlog = (null) ReturnToService = 1 RoutePlugin = route/default SallocDefaultCommand = (null) SbcastParameters = (null) SchedulerParameters = (null) SchedulerTimeSlice = 30 sec SchedulerType = sched/backfill SelectType = select/cons_res SelectTypeParameters = CR_CORE_MEMORY SlurmUser = slurm(497) SlurmctldDebug = debug3 SlurmctldLogFile = /mnt/slurm-logs/slurmctld.log SlurmctldPort = 6817 SlurmctldSyslogDebug = quiet SlurmctldTimeout = 300 sec SlurmdDebug = debug5 SlurmdLogFile = /var/log/slurmd.log SlurmdPidFile = /opt/slurm/var/run/slurmd.pid SlurmdPort = 6818 SlurmdSpoolDir = /opt/slurm/var/spool SlurmdSyslogDebug = quiet SlurmdTimeout = 300 sec SlurmdUser = root(0) SlurmSchedLogFile = /mnt/slurm-logs/slurmsched.log SlurmSchedLogLevel = 9 SlurmctldPidFile = /opt/slurm/var/run/slurmctld.pid SlurmctldPlugstack = (null) SLURM_CONF = /opt/slurm-17.11.4/etc/slurm.conf SLURM_VERSION = 17.11.4 SrunEpilog = (null) SrunPortRange = 0-0 SrunProlog = (null) StateSaveLocation = /opt/slurm/var/state SuspendExcNodes = (null) SuspendExcParts = (null) SuspendProgram = (null) SuspendRate = 60 nodes/min SuspendTime = NONE SuspendTimeout = 30 sec SwitchType = switch/none TaskEpilog = (null) TaskPlugin = task/cgroup TaskPluginParam = (null type) TaskProlog = (null) TCPTimeout = 2 sec TmpFS = /tmp TopologyParam = (null) TopologyPlugin = topology/none TrackWCKey = No TreeWidth = 50 UsePam = 0 UnkillableStepProgram = (null) UnkillableStepTimeout = 60 sec VSizeFactor = 0 percent WaitTime = 0 sec Slurmctld(primary/backup) at m3-mgmt2/m3-mgmt1 are UP/UP Kindly investigate. Thanks Damien Hi Let’s try to add binary path, maybe this will give us proper backtrace: gdb -batch -ex "thread apply all bt full" /opt/slurm-17.11.4/sbin/slurmctld core.14684 Dominik There you go:
gdb -batch -ex "thread apply all bt full" /opt/slurm-17.11.4/sbin/slurmctld core.32003
[New LWP 32010]
[New LWP 32014]
[New LWP 32012]
[New LWP 32013]
[New LWP 32011]
[New LWP 32015]
[New LWP 32022]
[New LWP 32016]
[New LWP 32020]
[New LWP 32019]
[New LWP 32004]
[New LWP 32003]
[New LWP 32005]
[New LWP 32006]
[New LWP 32008]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/opt/slurm-17.11.4/sbin/slurmctld'.
Program terminated with signal 6, Aborted.
#0 0x00007efc7b4605f7 in raise () from /lib64/libc.so.6
Thread 15 (Thread 0x7efc785ca700 (LWP 32008)):
#0 0x00007efc7b7f8a82 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
No symbol table info available.
#1 0x00007efc7bd4765f in _agent (x=<optimized out>) at slurmdbd_defs.c:1979
err = <optimized out>
cnt = <optimized out>
rc = <optimized out>
buffer = <optimized out>
abs_time = {tv_sec = 1531981566, tv_nsec = 0}
fail_time = 0
sigarray = {10, 0}
list_req = {msg_type = 1474, data = 0x7efc785c9ea0}
list_msg = {my_list = 0x0, return_code = 0}
__func__ = "_agent"
#2 0x00007efc7b7f4dc5 in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#3 0x00007efc7b521ced in clone () from /lib64/libc.so.6
No symbol table info available.
Thread 14 (Thread 0x7efc786cb700 (LWP 32006)):
#0 0x00007efc7b7f5ef7 in pthread_join () from /lib64/libpthread.so.0
No symbol table info available.
#1 0x00007efc787cff6e in _cleanup_thread (no_data=<optimized out>) at accounting_storage_slurmdbd.c:445
No locals.
#2 0x00007efc7b7f4dc5 in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#3 0x00007efc7b521ced in clone () from /lib64/libc.so.6
No symbol table info available.
Thread 13 (Thread 0x7efc787cc700 (LWP 32005)):
#0 0x00007efc7b4e8efd in nanosleep () from /lib64/libc.so.6
No symbol table info available.
#1 0x00007efc7b4e8d94 in sleep () from /lib64/libc.so.6
No symbol table info available.
#2 0x00007efc787d080e in _set_db_inx_thread (no_data=<optimized out>) at accounting_storage_slurmdbd.c:437
local_job_list = <optimized out>
job_ptr = <optimized out>
itr = <optimized out>
job_read_lock = {config = NO_LOCK, job = READ_LOCK, node = NO_LOCK, partition = NO_LOCK, federation = NO_LOCK}
job_write_lock = {config = NO_LOCK, job = WRITE_LOCK, node = NO_LOCK, partition = NO_LOCK, federation = NO_LOCK}
__func__ = "_set_db_inx_thread"
#3 0x00007efc7b7f4dc5 in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#4 0x00007efc7b521ced in clone () from /lib64/libc.so.6
No symbol table info available.
Thread 12 (Thread 0x7efc7c1d1740 (LWP 32003)):
#0 0x00007efc7b4e8efd in nanosleep () from /lib64/libc.so.6
No symbol table info available.
#1 0x00007efc7b519b34 in usleep () from /lib64/libc.so.6
No symbol table info available.
#2 0x00000000004279f4 in _slurmctld_background (no_data=0x0) at controller.c:1767
i = 8
job_limit = <optimized out>
delta_t = 7
last_full_sched_time = 1531981512
last_ctld_bu_ping = 1531981415
last_uid_update = 1531980772
last_reboot_msg_time = 1531199546
ping_interval = 100
job_read_lock = {config = READ_LOCK, job = READ_LOCK, node = NO_LOCK, partition = NO_LOCK, federation = NO_LOCK}
job_write_lock = {config = READ_LOCK, job = WRITE_LOCK, node = WRITE_LOCK, partition = READ_LOCK, federation = READ_LOCK}
job_node_read_lock = {config = NO_LOCK, job = READ_LOCK, node = READ_LOCK, partition = NO_LOCK, federation = NO_LOCK}
last_group_time = 1531981473
last_acct_gather_node_time = 1531199545
last_ext_sensors_time = 1531199545
last_resv_time = 1531981556
tv1 = {tv_sec = 1531981560, tv_usec = 209825}
node_write_lock2 = {config = NO_LOCK, job = NO_LOCK, node = WRITE_LOCK, partition = NO_LOCK, federation = NO_LOCK}
last_timelimit_time = 1531981536
last_assert_primary_time = 1531981358
purge_job_interval = 60
tv2 = {tv_sec = 1531981560, tv_usec = 209832}
config_read_lock = {config = READ_LOCK, job = NO_LOCK, node = NO_LOCK, partition = NO_LOCK, federation = NO_LOCK}
node_write_lock = {config = READ_LOCK, job = WRITE_LOCK, node = WRITE_LOCK, partition = NO_LOCK, federation = NO_LOCK}
last_purge_job_time = 1531981512
last_node_acct = 1531981285
no_resp_msg_interval = <optimized out>
tv_str = "usec=7\000\000\064\066\000\067\000\000\000\000\000\000\000"
job_write_lock2 = {config = NO_LOCK, job = WRITE_LOCK, node = NO_LOCK, partition = NO_LOCK, federation = NO_LOCK}
last_no_resp_msg_time = 1531981560
now = <optimized out>
last_sched_time = 1531981554
last_ping_node_time = 1531981506
part_write_lock = {config = NO_LOCK, job = NO_LOCK, node = NO_LOCK, partition = WRITE_LOCK, federation = NO_LOCK}
last_health_check_time = 1531981363
last_checkpoint_time = 1531981452
last_ping_srun_time = 1531199545
last_trigger = 1531981546
#3 main (argc=<optimized out>, argv=<optimized out>) at controller.c:604
cnt = <optimized out>
error_code = <optimized out>
i = 3
stat_buf = {st_dev = 64769, st_ino = 143988, st_nlink = 1, st_mode = 33261, st_uid = 0, st_gid = 0, __pad0 = 0, st_rdev = 0, st_size = 392784, st_blksize = 4096, st_blocks = 768, st_atim = {tv_sec = 1531188475, tv_nsec = 698018084}, st_mtim = {tv_sec = 1418762451, tv_nsec = 0}, st_ctim = {tv_sec = 1463116446, tv_nsec = 736207571}, __unused = {0, 0, 0}}
rlim = {rlim_cur = 18446744073709551615, rlim_max = 18446744073709551615}
config_write_lock = {config = WRITE_LOCK, job = WRITE_LOCK, node = WRITE_LOCK, partition = WRITE_LOCK, federation = NO_LOCK}
node_part_write_lock = {config = NO_LOCK, job = NO_LOCK, node = WRITE_LOCK, partition = WRITE_LOCK, federation = NO_LOCK}
callbacks = {acct_full = 0x4a93eb <trigger_primary_ctld_acct_full>, dbd_fail = 0x4a95fa <trigger_primary_dbd_fail>, dbd_resumed = 0x4a9688 <trigger_primary_dbd_res_op>, db_fail = 0x4a970d <trigger_primary_db_fail>, db_resumed = 0x4a979b <trigger_primary_db_res_op>}
create_clustername_file = 44
__func__ = "main"
Thread 11 (Thread 0x7efc7c1d0700 (LWP 32004)):
#0 0x00007efc7b7f86d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
No symbol table info available.
#1 0x0000000000464f20 in _wr_wrlock (datatype=datatype@entry=JOB_LOCK) at locks.c:229
err = <optimized out>
__func__ = "_wr_wrlock"
#2 0x000000000046516c in lock_slurmctld (lock_levels=...) at locks.c:133
No locals.
#3 0x000000000041e1ac in _agent_retry (mail_too=false, min_wait=999) at agent.c:1381
agent_arg_ptr = 0x0
mi = 0x0
rc = <optimized out>
now = 1531981561
queued_req_ptr = 0x0
retry_iter = <optimized out>
job_write_lock = {config = NO_LOCK, job = WRITE_LOCK, node = NO_LOCK, partition = NO_LOCK, federation = NO_LOCK}
#4 _agent_init (arg=<optimized out>) at agent.c:1326
min_wait = 999
mail_too = false
ts = {tv_sec = 1531981562, tv_nsec = 0}
__func__ = "_agent_init"
#5 0x00007efc7b7f4dc5 in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#6 0x00007efc7b521ced in clone () from /lib64/libc.so.6
No symbol table info available.
Thread 10 (Thread 0x7efc721b5700 (LWP 32019)):
#0 0x00007efc7b7fbe91 in sigwait () from /lib64/libpthread.so.0
No symbol table info available.
#1 0x000000000042925c in _slurmctld_signal_hand (no_data=<optimized out>) at controller.c:891
sig = 1
i = <optimized out>
rc = <optimized out>
sig_array = {2, 15, 1, 6, 12, 0}
set = {__val = {18467, 0 <repeats 15 times>}}
__func__ = "_slurmctld_signal_hand"
#2 0x00007efc7b7f4dc5 in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#3 0x00007efc7b521ced in clone () from /lib64/libc.so.6
No symbol table info available.
Thread 9 (Thread 0x7efc720b4700 (LWP 32020)):
#0 0x00007efc7b7f86d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
No symbol table info available.
#1 0x000000000049ee9c in slurmctld_state_save (no_data=<optimized out>) at state_save.c:204
err = <optimized out>
last_save = 1531981553
now = 1531981553
save_delay = <optimized out>
run_save = <optimized out>
save_count = 0
__func__ = "slurmctld_state_save"
#2 0x00007efc7b7f4dc5 in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#3 0x00007efc7b521ced in clone () from /lib64/libc.so.6
No symbol table info available.
Thread 8 (Thread 0x7efc722b6700 (LWP 32016)):
#0 0x00007efc7b519413 in select () from /lib64/libc.so.6
No symbol table info available.
#1 0x0000000000424f09 in _slurmctld_rpc_mgr (no_data=<optimized out>) at controller.c:1026
max_fd = <optimized out>
newsockfd = <optimized out>
sockfd = 0x7efc5c000950
cli_addr = {sin_family = 2, sin_port = 35469, sin_addr = {s_addr = 4039643308}, sin_zero = "\000\000\000\000\000\000\000"}
srv_addr = {sin_family = 2, sin_port = 41242, sin_addr = {s_addr = 0}, sin_zero = "\000\000\000\000\000\000\000"}
port = 41242
ip = "0.0.0.0", '\000' <repeats 24 times>
fd_next = 0
i = <optimized out>
nports = 1
rfds = {__fds_bits = {8, 0 <repeats 15 times>}}
conn_arg = <optimized out>
config_read_lock = {config = READ_LOCK, job = NO_LOCK, node = NO_LOCK, partition = NO_LOCK, federation = NO_LOCK}
sigarray = {10, 0}
node_addr = <optimized out>
__func__ = "_slurmctld_rpc_mgr"
#2 0x00007efc7b7f4dc5 in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#3 0x00007efc7b521ced in clone () from /lib64/libc.so.6
No symbol table info available.
Thread 7 (Thread 0x7efc71eb2700 (LWP 32022)):
#0 0x00007efc7b7f86d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
No symbol table info available.
#1 0x0000000000423536 in _purge_files_thread (no_data=<optimized out>) at controller.c:3160
err = <optimized out>
job_id = 0x0
__func__ = "_purge_files_thread"
#2 0x00007efc7b7f4dc5 in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#3 0x00007efc7b521ced in clone () from /lib64/libc.so.6
No symbol table info available.
Thread 6 (Thread 0x7efc725b9700 (LWP 32015)):
#0 0x00007efc7b7f5ef7 in pthread_join () from /lib64/libpthread.so.0
No symbol table info available.
#1 0x00007efc729bfe75 in _cleanup_thread (no_data=<optimized out>) at priority_multifactor.c:1453
No locals.
#2 0x00007efc7b7f4dc5 in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#3 0x00007efc7b521ced in clone () from /lib64/libc.so.6
No symbol table info available.
Thread 5 (Thread 0x7efc72eca700 (LWP 32011)):
#0 0x00007efc7b7f8a82 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
No symbol table info available.
#1 0x000000000043543a in _heartbeat_thread (no_data=<optimized out>) at heartbeat.c:130
err = <optimized out>
beat = 30
now = <optimized out>
nl = 16730398017500217344
ts = {tv_sec = 1531981574, tv_nsec = 0}
reg_file = 0x0
new_file = 0x0
fd = 7
__func__ = "_heartbeat_thread"
#2 0x00007efc7b7f4dc5 in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#3 0x00007efc7b521ced in clone () from /lib64/libc.so.6
No symbol table info available.
Thread 4 (Thread 0x7efc72cc8700 (LWP 32013)):
#0 0x00007efc7b7f8a82 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
No symbol table info available.
#1 0x000000000043060d in _fed_job_update_thread (arg=<optimized out>) at fed_mgr.c:2161
err = <optimized out>
ts = {tv_sec = 1531981562, tv_nsec = 0}
job_update_info = <optimized out>
__func__ = "_fed_job_update_thread"
#2 0x00007efc7b7f4dc5 in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#3 0x00007efc7b521ced in clone () from /lib64/libc.so.6
No symbol table info available.
Thread 3 (Thread 0x7efc72dc9700 (LWP 32012)):
#0 0x00007efc7b7f8a82 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
No symbol table info available.
#1 0x000000000042c963 in _agent_thread (arg=<optimized out>) at fed_mgr.c:2203
err = <optimized out>
cluster = <optimized out>
ts = {tv_sec = 1531981562, tv_nsec = 0}
cluster_iter = <optimized out>
rpc_iter = <optimized out>
rpc_rec = <optimized out>
req_msg = {address = {sin_family = 0, sin_port = 0, sin_addr = {s_addr = 0}, sin_zero = "\000\000\000\000\000\000\000"}, auth_cred = 0x0, body_offset = 0, buffer = 0x0, conn = 0x0, conn_fd = 0, data = 0x0, data_size = 0, flags = 0, msg_index = 0, msg_type = 0, protocol_version = 0, forward = {cnt = 0, init = 0, nodelist = 0x0, timeout = 0, tree_width = 0}, forward_struct = 0x0, orig_addr = {sin_family = 0, sin_port = 0, sin_addr = {s_addr = 0}, sin_zero = "\000\000\000\000\000\000\000"}, ret_list = 0x0}
resp_msg = {address = {sin_family = 0, sin_port = 0, sin_addr = {s_addr = 0}, sin_zero = "\000\000\000\000\000\000\000"}, auth_cred = 0x0, body_offset = 0, buffer = 0x0, conn = 0x0, conn_fd = 0, data = 0x0, data_size = 0, flags = 0, msg_index = 0, msg_type = 0, protocol_version = 0, forward = {cnt = 0, init = 0, nodelist = 0x0, timeout = 0, tree_width = 0}, forward_struct = 0x0, orig_addr = {sin_family = 0, sin_port = 0, sin_addr = {s_addr = 0}, sin_zero = "\000\000\000\000\000\000\000"}, ret_list = 0x0}
ctld_req_msg = {my_list = 0x0}
success_bits = <optimized out>
rc = <optimized out>
resp_inx = <optimized out>
success_size = <optimized out>
fed_read_lock = {config = NO_LOCK, job = NO_LOCK, node = NO_LOCK, partition = NO_LOCK, federation = READ_LOCK}
__func__ = "_agent_thread"
#2 0x00007efc7b7f4dc5 in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#3 0x00007efc7b521ced in clone () from /lib64/libc.so.6
No symbol table info available.
Thread 2 (Thread 0x7efc726ba700 (LWP 32014)):
#0 0x00007efc7b4e8efd in nanosleep () from /lib64/libc.so.6
No symbol table info available.
#1 0x00007efc7b4e8d94 in sleep () from /lib64/libc.so.6
No symbol table info available.
#2 0x00007efc729c2596 in _decay_thread (no_data=<optimized out>) at priority_multifactor.c:1333
start_time = 1531981347
last_reset = 1469517764
next_reset = 0
calc_period = 300
decay_hl = <optimized out>
reset_period = 0
now = 1531981347
run_delta = <optimized out>
real_decay = <optimized out>
elapsed = <optimized out>
job_write_lock = {config = NO_LOCK, job = WRITE_LOCK, node = READ_LOCK, partition = READ_LOCK, federation = NO_LOCK}
locks = {assoc = WRITE_LOCK, file = NO_LOCK, qos = NO_LOCK, res = NO_LOCK, tres = NO_LOCK, user = NO_LOCK, wckey = NO_LOCK}
__func__ = "_decay_thread"
#3 0x00007efc7b7f4dc5 in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#4 0x00007efc7b521ced in clone () from /lib64/libc.so.6
No symbol table info available.
Thread 1 (Thread 0x7efc735ed700 (LWP 32010)):
#0 0x00007efc7b4605f7 in raise () from /lib64/libc.so.6
No symbol table info available.
#1 0x00007efc7b461ce8 in abort () from /lib64/libc.so.6
No symbol table info available.
#2 0x00007efc7b459566 in __assert_fail_base () from /lib64/libc.so.6
No symbol table info available.
#3 0x00007efc7b459612 in __assert_fail () from /lib64/libc.so.6
No symbol table info available.
#4 0x00007efc7bc72a21 in bit_ffs (b=<optimized out>) at bitstring.c:475
bit = 0
value = -1
__PRETTY_FUNCTION__ = "bit_ffs"
#5 0x00007efc79dfb8cc in _build_row_bitmaps (p_ptr=p_ptr@entry=0x7efc4c892110, job_ptr=job_ptr@entry=0x7efc44446bd0) at select_cons_res.c:677
i = 0
j = 0
num_jobs = 8
size = <optimized out>
x = 0
this_row = <optimized out>
orig_row = 0x7efc4c8c60b0
ss = 0x7efc4c7511a0
__func__ = "_build_row_bitmaps"
#6 0x00007efc79dfc053 in _rm_job_from_res (part_record_ptr=part_record_ptr@entry=0x7efc4c0142c0, node_usage=node_usage@entry=0x7efc4c209940, job_ptr=job_ptr@entry=0x7efc44446bd0, action=action@entry=0) at select_cons_res.c:1294
p_ptr = 0x7efc4c892110
job = 0x7efc4c643a20
node_ptr = <optimized out>
first_bit = 0
last_bit = <optimized out>
i = <optimized out>
n = <optimized out>
gres_list = <optimized out>
__func__ = "_rm_job_from_res"
#7 0x00007efc79dfd07c in _will_run_test (exc_core_bitmap=0x0, preemptee_job_list=0x7efc735ecab8, preemptee_candidates=0x0, job_node_req=<optimized out>, req_nodes=1, max_nodes=1, min_nodes=1, bitmap=0x7efc4c0feaa0, job_ptr=0x7efc4c0e5e90) at select_cons_res.c:1931
first_job_ptr = 0x7efc44446bd0
next_job_ptr = <optimized out>
overlap = <optimized out>
last_job_ptr = 0x7efc44446bd0
rm_job_cnt = 0
tv1 = {tv_sec = 1531981561, tv_usec = 105137}
tv_str = '\000' <repeats 19 times>
delta_t = 139621943865360
time_window = 30
more_jobs = true
tv2 = {tv_sec = 139622874530176, tv_usec = 139622071926816}
cr_job_list = 0x7efc44b91fb0
tmp_cr_type = 20
future_part = 0x7efc4c0142c0
tmp_job_ptr = 0x7efc44446bd0
preemptee_iterator = <optimized out>
orig_map = 0x7efc4c014300
qos_preemptor = false
future_usage = 0x7efc4c209940
job_iterator = 0x7efc74000990
action = <optimized out>
rc = -1
now = 1531981561
#8 select_p_job_test (job_ptr=0x7efc4c0e5e90, bitmap=0x7efc4c0feaa0, min_nodes=1, max_nodes=1, req_nodes=1, mode=<optimized out>, preemptee_candidates=0x0, preemptee_job_list=0x7efc735ecab8, exc_core_bitmap=0x0) at select_cons_res.c:2310
rc = 22
debug_cpu_bind = false
debug_check = true
#9 0x00007efc7bca9a3c in select_g_job_test (job_ptr=job_ptr@entry=0x7efc4c0e5e90, bitmap=0x7efc4c0feaa0, min_nodes=min_nodes@entry=1, max_nodes=max_nodes@entry=1, req_nodes=req_nodes@entry=1, mode=mode@entry=2, preemptee_candidates=preemptee_candidates@entry=0x0, preemptee_job_list=preemptee_job_list@entry=0x7efc735ecab8, exc_core_bitmap=exc_core_bitmap@entry=0x0) at node_select.c:582
No locals.
#10 0x00007efc735f2f39 in _try_sched (job_ptr=job_ptr@entry=0x7efc4c0e5e90, avail_bitmap=avail_bitmap@entry=0x7efc735ecdf8, min_nodes=1, max_nodes=1, req_nodes=1, exc_core_bitmap=0x0) at backfill.c:482
orig_shared = 254
now = 1531981561
str = "\300\000\000\000\000\000\000\000\a\000\000\000\000\000\000\000\002\000\000\000\000\000\000\000\070\000\000\000\000\000\000\000(\000\000\000\000\000\000\000\340\063\326{\374~\000\000\247\000\000\000\000\000\000\000e9\326{\374~\000\000\305\065P[\000\000\000\000\301I\325{\374~\000\000\240\316^s\374~\000\000\240\000\000\000\000\000\000\000\240\316^s"
low_bitmap = 0x0
tmp_bitmap = 0x7efc4c014280
rc = 0
has_xor = false
feat_cnt = 0
detail_ptr = <optimized out>
preemptee_candidates = 0x0
preemptee_job_list = 0x0
feat_iter = <optimized out>
feat_ptr = <optimized out>
__func__ = "_try_sched"
#11 0x00007efc735f5677 in _attempt_backfill () at backfill.c:1894
bf_job_id = <optimized out>
bf_array_task_id = <optimized out>
bf_job_priority = <optimized out>
tv1 = {tv_sec = 1531981560, tv_usec = 876382}
tv2 = {tv_sec = 0, tv_usec = 139622873694297}
tv_str = '\000' <repeats 19 times>
delta_t = 139622873694297
job_queue = <optimized out>
job_queue_rec = 0x0
bb = <optimized out>
i = <optimized out>
j = <optimized out>
k = <optimized out>
node_space_recs = <optimized out>
mcs_select = <optimized out>
qos_ptr = <optimized out>
job_ptr = 0x7efc4c0e5e90
part_ptr = <optimized out>
bf_part_ptr = 0x0
end_time = 1531983301
end_reserve = <optimized out>
deadline_time_limit = <optimized out>
boot_time = 0
orig_end_time = <optimized out>
time_limit = <optimized out>
comp_time_limit = <optimized out>
orig_time_limit = <optimized out>
part_time_limit = <optimized out>
min_nodes = 1
max_nodes = 1
req_nodes = 1
active_bitmap = 0x0
avail_bitmap = 0x7efc4c0feaa0
exc_core_bitmap = 0x0
resv_bitmap = 0x7efc4c00da50
now = 1531981561
sched_start = <optimized out>
later_start = 0
start_res = 1531981561
resv_end = <optimized out>
window_end = <optimized out>
orig_sched_start = <optimized out>
orig_start_time = <optimized out>
node_space = 0x7efc4c0f26a0
bf_user_part_ptr = 0x0
bf_time1 = {tv_sec = 1531981560, tv_usec = 877289}
bf_time2 = {tv_sec = 1531981530, tv_usec = 876177}
rc = 0
error_code = <optimized out>
job_test_count = <optimized out>
test_time_count = <optimized out>
pend_time = <optimized out>
uid = 0x0
nuser = <optimized out>
bf_parts = <optimized out>
bf_part_jobs = 0x0
bf_part_resv = 0x0
njobs = 0x0
already_counted = true
reject_array_job_id = <optimized out>
reject_array_part = <optimized out>
job_start_cnt = <optimized out>
start_time = <optimized out>
config_update = <optimized out>
part_update = <optimized out>
start_tv = {tv_sec = 1531981560, tv_usec = 876400}
test_array_job_id = <optimized out>
test_array_count = <optimized out>
job_no_reserve = <optimized out>
resv_overlap = true
save_share_res = <optimized out>
save_whole_node = <optimized out>
test_fini = -1
user_part_inx1 = <optimized out>
user_part_inx2 = <optimized out>
part_inx = <optimized out>
user_inx = <optimized out>
qos_flags = <optimized out>
qos_blocked_until = <optimized out>
qos_part_blocked_until = <optimized out>
qos_read_lock = {assoc = NO_LOCK, file = NO_LOCK, qos = READ_LOCK, res = NO_LOCK, tres = NO_LOCK, user = NO_LOCK, wckey = NO_LOCK}
__func__ = "_attempt_backfill"
#12 0x00007efc735f7bc0 in backfill_agent (args=<optimized out>) at backfill.c:904
now = <optimized out>
wait_time = <optimized out>
last_backfill_time = 1531981530
all_locks = {config = READ_LOCK, job = WRITE_LOCK, node = WRITE_LOCK, partition = READ_LOCK, federation = READ_LOCK}
load_config = <optimized out>
short_sleep = <optimized out>
backfill_cnt = 23555
__func__ = "backfill_agent"
#13 0x00007efc7b7f4dc5 in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#14 0x00007efc7b521ced in clone () from /lib64/libc.so.6
No symbol table info available.
Thanks
Damien
Hi Could you use interactively gdb on core.32003 and run these commands? t 1 f 7 p tmp_job_ptr p rm_job_cnt Dominik Hi Dominik I’m not familiar with gdb. Can you give me more details? Thanks Damien On Friday, 20 July 2018, <bugs@schedmd.com> wrote: > *Comment # 9 <https://bugs.schedmd.com/show_bug.cgi?id=5452#c9> on bug > 5452 <https://bugs.schedmd.com/show_bug.cgi?id=5452> from Dominik > Bartkiewicz <bart@schedmd.com> * > > Hi > > Could you use interactively gdb on core.32003 and run these commands? > t 1 > f 7 > p tmp_job_ptr > p rm_job_cnt > > Dominik > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > > Hi Of course gdb /opt/slurm-17.11.4/sbin/slurmctld core.32003 then you should see propt like this "(gdb)" Go to thread 1 t 1 then pick frame 7 f 7 And you can print some values p tmp_job_ptr p rm_job_cnt Dominik Hi Let me know if this is clear. Could you send me the value of ss[x].tmpjobs? eg.: thread 1 frame 5 print ss[x].tmpjobs Dominik Hi Dominik There are the values: tmp]# gdb /opt/slurm-17.11.4/sbin/slurmctld core.32003 GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-80.el7 Copyright (C) 2013 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-redhat-linux-gnu". For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>... Reading symbols from /opt/slurm-17.11.4/sbin/slurmctld...done. [New LWP 32010] [New LWP 32014] [New LWP 32012] [New LWP 32013] [New LWP 32011] [New LWP 32015] [New LWP 32022] [New LWP 32016] [New LWP 32020] [New LWP 32019] [New LWP 32004] [New LWP 32003] [New LWP 32005] [New LWP 32006] [New LWP 32008] [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". Core was generated by `/opt/slurm-17.11.4/sbin/slurmctld'. Program terminated with signal 6, Aborted. #0 0x00007efc7b4605f7 in raise () from /lib64/libc.so.6 Missing separate debuginfos, use: debuginfo-install glibc-2.17-106.el7_2.8.x86_64 sssd-client-1.13.0-40.el7_2.12.x86_64 (gdb) t 1 [Switching to thread 1 (Thread 0x7efc735ed700 (LWP 32010))] #0 0x00007efc7b4605f7 in raise () from /lib64/libc.so.6 (gdb) f 7 #7 0x00007efc79dfd07c in _will_run_test (exc_core_bitmap=0x0, preemptee_job_list=0x7efc735ecab8, preemptee_candidates=0x0, job_node_req=<optimized out>, req_nodes=1, max_nodes=1, min_nodes=1, bitmap=0x7efc4c0feaa0, job_ptr=0x7efc4c0e5e90) at select_cons_res.c:1931 1931 select_cons_res.c: No such file or directory. (gdb) p tmp_job_ptr $1 = (struct job_record *) 0x7efc44446bd0 (gdb) p rm_job_cnt $2 = 0 (gdb) Hi Dominik
There are the extras:
(gdb) thread 1
[Switching to thread 1 (Thread 0x7efc735ed700 (LWP 32010))]
#7 0x00007efc79dfd07c in _will_run_test (exc_core_bitmap=0x0, preemptee_job_list=0x7efc735ecab8, preemptee_candidates=0x0,
job_node_req=<optimized out>, req_nodes=1, max_nodes=1, min_nodes=1, bitmap=0x7efc4c0feaa0, job_ptr=0x7efc4c0e5e90) at select_cons_res.c:1931
1931 in select_cons_res.c
(gdb) frame 5
#5 0x00007efc79dfb8cc in _build_row_bitmaps (p_ptr=p_ptr@entry=0x7efc4c892110, job_ptr=job_ptr@entry=0x7efc44446bd0) at select_cons_res.c:677
677 in select_cons_res.c
(gdb) print ss[x].tmpjobs
$7 = (struct job_resources *) 0x7efc44660740
(gdb)
I hope that this is sufficient, else please let us know.
Many Thanks
Damien
Hi Thank you I appreciate your efforts and patience. I should already ask for this, could you attach this? thread 1 frame 5 info locals print *(ss[x].tmpjobs) t 1 f 7 info locals print *tmp_job_ptr Dominik Hi Dominik
There you goes:
(gdb)
(gdb) thread 1
[Switching to thread 1 (Thread 0x7efc735ed700 (LWP 32010))]
#0 0x00007efc7b4605f7 in raise () from /lib64/libc.so.6
(gdb) frame 5
#5 0x00007efc79dfb8cc in _build_row_bitmaps (p_ptr=p_ptr@entry=0x7efc4c892110, job_ptr=job_ptr@entry=0x7efc44446bd0) at select_cons_res.c:677
677 select_cons_res.c: No such file or directory.
(gdb) info locals
i = 0
j = 0
num_jobs = 8
size = <optimized out>
x = 0
this_row = <optimized out>
orig_row = 0x7efc4c8c60b0
ss = 0x7efc4c7511a0
__func__ = "_build_row_bitmaps"
(gdb) print *(ss[x].tmpjobs)
$1 = {core_bitmap = 0x0, core_bitmap_used = 0x0, cpu_array_cnt = 1, cpu_array_value = 0x0, cpu_array_reps = 0x0, cpus = 0x0, cpus_used = 0x0,
cores_per_socket = 0x0, memory_allocated = 0x0, memory_used = 0x0, nhosts = 1, node_bitmap = 0x0, node_req = 1, nodes = 0x0, ncpus = 1,
sock_core_rep_count = 0x0, sockets_per_node = 0x0, whole_node = 0 '\000'}
(gdb)
$2 = {core_bitmap = 0x0, core_bitmap_used = 0x0, cpu_array_cnt = 1, cpu_array_value = 0x0, cpu_array_reps = 0x0, cpus = 0x0, cpus_used = 0x0,
cores_per_socket = 0x0, memory_allocated = 0x0, memory_used = 0x0, nhosts = 1, node_bitmap = 0x0, node_req = 1, nodes = 0x0, ncpus = 1,
sock_core_rep_count = 0x0, sockets_per_node = 0x0, whole_node = 0 '\000'}
(gdb)
$3 = {core_bitmap = 0x0, core_bitmap_used = 0x0, cpu_array_cnt = 1, cpu_array_value = 0x0, cpu_array_reps = 0x0, cpus = 0x0, cpus_used = 0x0,
cores_per_socket = 0x0, memory_allocated = 0x0, memory_used = 0x0, nhosts = 1, node_bitmap = 0x0, node_req = 1, nodes = 0x0, ncpus = 1,
sock_core_rep_count = 0x0, sockets_per_node = 0x0, whole_node = 0 '\000'}
(gdb) t 1
[Switching to thread 1 (Thread 0x7efc735ed700 (LWP 32010))]
#5 0x00007efc79dfb8cc in _build_row_bitmaps (p_ptr=p_ptr@entry=0x7efc4c892110, job_ptr=job_ptr@entry=0x7efc44446bd0) at select_cons_res.c:677
677 in select_cons_res.c
(gdb) f 5
#5 0x00007efc79dfb8cc in _build_row_bitmaps (p_ptr=p_ptr@entry=0x7efc4c892110, job_ptr=job_ptr@entry=0x7efc44446bd0) at select_cons_res.c:677
677 in select_cons_res.c
(gdb) f 7
#7 0x00007efc79dfd07c in _will_run_test (exc_core_bitmap=0x0, preemptee_job_list=0x7efc735ecab8, preemptee_candidates=0x0,
job_node_req=<optimized out>, req_nodes=1, max_nodes=1, min_nodes=1, bitmap=0x7efc4c0feaa0, job_ptr=0x7efc4c0e5e90) at select_cons_res.c:1931
1931 in select_cons_res.c
(gdb) info locals
first_job_ptr = 0x7efc44446bd0
next_job_ptr = <optimized out>
overlap = <optimized out>
last_job_ptr = 0x7efc44446bd0
rm_job_cnt = 0
tv1 = {tv_sec = 1531981561, tv_usec = 105137}
tv_str = '\000' <repeats 19 times>
delta_t = 139621943865360
time_window = 30
more_jobs = true
tv2 = {tv_sec = 139622874530176, tv_usec = 139622071926816}
cr_job_list = 0x7efc44b91fb0
tmp_cr_type = 20
future_part = 0x7efc4c0142c0
tmp_job_ptr = 0x7efc44446bd0
preemptee_iterator = <optimized out>
orig_map = 0x7efc4c014300
qos_preemptor = false
future_usage = 0x7efc4c209940
job_iterator = 0x7efc74000990
action = <optimized out>
rc = -1
now = 1531981561
(gdb) print *tmp_job_ptr
$4 = {account = 0x7efc4423e7c0 "ax22", admin_comment = 0x0, alias_list = 0x0, alloc_node = 0x7efc446f8000 "m3-login2", alloc_resp_port = 0,
alloc_sid = 667, array_job_id = 2814050, array_task_id = 1, array_recs = 0x0, assoc_id = 1337, assoc_ptr = 0x111d1c0, batch_flag = 1,
batch_host = 0x7efc4c22a170 "m3a000", billable_tres = 6, bit_flags = 0, burst_buffer = 0x0, burst_buffer_state = 0x0, check_job = 0x0,
ckpt_interval = 0, ckpt_time = 0, clusters = 0x0, comment = 0x0, cpu_cnt = 6, cr_enabled = 1, db_index = 0, deadline = 0, delay_boot = 0,
derived_ec = 0, details = 0x7efc442249f0, direct_set_prio = 0, end_time = 1531983301, end_time_exp = 1531983301, epilog_running = false,
exit_code = 0, fed_details = 0x0, front_end_ptr = 0x0, gids = 0x0, gres = 0x0, gres_list = 0x0, gres_alloc = 0x7efc4c40b0d0 "",
gres_detail_cnt = 0, gres_detail_str = 0x0, gres_req = 0x7efc4c2653c0 "", gres_used = 0x0, group_id = 10025, job_id = 2814051, job_next = 0x0,
job_array_next_j = 0x0, job_array_next_t = 0x0, job_resrcs = 0x7efc4c643a20, job_state = 1, kill_on_node_fail = 1,
last_sched_eval = 1531981561, licenses = 0x0, license_list = 0x0, limit_set = {qos = 0, time = 0, tres = 0x7efc4414c940}, mail_type = 0,
mail_user = 0x0, magic = 4038539564, mcs_label = 0x0, name = 0x7efc446f7fd0 "seecr19july", network = 0x0, next_step_id = 0, ngids = 0,
nodes = 0x7efc4c2653a0 "m3a000", node_addr = 0x7efc4c119940, node_bitmap = 0x7efc4c148520, node_bitmap_cg = 0x0, node_cnt = 1,
node_cnt_wag = 1, nodes_completing = 0x0, origin_cluster = 0x0, other_port = 0, pack_job_id = 0, pack_job_id_set = 0x0, pack_job_offset = 0,
pack_job_list = 0x0, partition = 0x7efc444ba010 "short", part_ptr_list = 0x0, part_nodes_missing = false, part_ptr = 0x7efc44863820,
power_flags = 0 '\000', pre_sus_time = 0, preempt_time = 0, preempt_in_progress = false, priority = 73105, priority_array = 0x0,
prio_factors = 0x7efc446f7f40, profile = 4294967295, qos_id = 1, qos_ptr = 0x10a3920, qos_blocking_ptr = 0x0, reboot = 0 '\000',
restart_cnt = 0, resize_time = 0, resv_id = 0, resv_name = 0x0, resv_ptr = 0x0, requid = 4294967295, resp_host = 0x0, sched_nodes = 0x0,
select_jobinfo = 0x7efc4489c500, spank_job_env = 0x0, spank_job_env_size = 0, start_protocol_ver = 8192, start_time = 1531981561,
state_desc = 0x0, state_reason = 0, state_reason_prev = 0, step_list = 0x7efc449009c0, suspend_time = 0, time_last_active = 1531981561,
time_limit = 29, time_min = 0, tot_sus_time = 0, total_cpus = 6, total_nodes = 1, tres_req_cnt = 0x7efc44c1d9b0,
tres_req_str = 0x7efc445ecca0 "1=6,2=8000,4=1", tres_fmt_req_str = 0x7efc4428bd50 "cpu=6,mem=8000M,node=1", tres_alloc_cnt = 0x7efc4c645f00,
tres_alloc_str = 0x7efc4c0bd590 "1=6,2=8000,3=18446744073709551614,4=1,5=6",
tres_fmt_alloc_str = 0x7efc4c40b040 "cpu=6,mem=8000M,node=1,billing=6", user_id = 11014, user_name = 0x0, wait_all_nodes = 0, warn_flags = 0,
warn_signal = 0, warn_time = 0, wckey = 0x0, req_switch = 0, wait4switch = 0, best_switch = true, wait4switch_start = 0}
(gdb)
I hope that you find the problem.
Many Thanks.
Damien
Hi We are still investigating this issue. Does this still occur? Dominik Hi Dominik It has not reappear now, but this has crashed twice last Thursday night, and once about a month ago. We are looking whether there is a preventative measure that we can use, or whether it is a CPU-load issue or configuration problem ? Cheers Damien Hi This patch should fix this issue. It hasn't been committed yet, but we think it will be soon in this or similar form. Dominik Hi This is fixed in commit: https://github.com/SchedMD/slurm/commit/fef07a409724 I'm going to go ahead and mark this as Resolved/Fixed, please feel free to re-open this if there's anything else we can help with. Dominik *** Ticket 5447 has been marked as a duplicate of this ticket. *** *** Ticket 5438 has been marked as a duplicate of this ticket. *** *** Ticket 5675 has been marked as a duplicate of this ticket. *** |