Description
Damien
2018-07-19 06:58:43 MDT
Hi Could you use gdb and generate the backtrace? eg.: gdb -batch -ex "thread apply all bt full" <core file> Dominik Created attachment 7354 [details]
Zipped slurmctld log
Created attachment 7355 [details]
core file
Hi Without all binary and libs, core file is useless. You must generate backtrace on slurmctld machine, you can do it like in comment 1. Dominik There is two core files today: [root@m3-mgmt2 slurm-logs]# gdb -batch -ex "thread apply all bt full" core.14684 [New LWP 31599] [New LWP 14686] [New LWP 14823] [New LWP 14689] [New LWP 14830] [New LWP 14684] [New LWP 14685] [New LWP 14687] [New LWP 14690] [New LWP 14696] [New LWP 14744] [New LWP 14745] [New LWP 14818] [New LWP 14832] [New LWP 14828] [New LWP 14829] Missing separate debuginfo for the main executable file Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/71/a7c60e3a83c09c01aec6f05752aef7f4e632e4 Core was generated by `/opt/slurm-17.11.4/sbin/slurmctld'. Program terminated with signal 6, Aborted. #0 0x00007f984fdae5f7 in ?? () "/mnt/slurm-logs/core.14684" is a core file. Please specify an executable to debug. Thread 16 (LWP 14829): #0 0x00007f9850149e91 in ?? () No symbol table info available. #1 0x0000000000000000 in ?? () No symbol table info available. Thread 15 (LWP 14828): #0 0x00007f984fe67413 in ?? () No symbol table info available. #1 0x0000000000000002 in ?? () No symbol table info available. #2 0x0000000000424f09 in ?? () No symbol table info available. #3 0x0000000000000001 in ?? () No symbol table info available. #4 0x0000000000000000 in ?? () No symbol table info available. Thread 14 (LWP 14832): #0 0x00007f98501466d5 in ?? () No symbol table info available. #1 0x0000000000000000 in ?? () No symbol table info available. Thread 13 (LWP 14818): #0 0x00007f984fe36efd in ?? () No symbol table info available. #1 0x0000000000000002 in ?? () No symbol table info available. #2 0x00007f984fe36d94 in ?? () No symbol table info available. #3 0x0000000000000030 in ?? () No symbol table info available. #4 0x000000000b642851 in ?? () No symbol table info available. #5 0x0000000000010000 in ?? () No symbol table info available. #6 0x0000000000000000 in ?? () No symbol table info available. Thread 12 (LWP 14745): #0 0x00007f9850146a82 in ?? () No symbol table info available. #1 0x0000000000000000 in ?? () No symbol table info available. Thread 11 (LWP 14744): #0 0x00007f9850146a82 in ?? () No symbol table info available. #1 0x0000000000000000 in ?? () No symbol table info available. Thread 10 (LWP 14696): #0 0x00007f9850146a82 in ?? () No symbol table info available. #1 0x0000000000000000 in ?? () No symbol table info available. Thread 9 (LWP 14690): #0 0x00007f9850146a82 in ?? () No symbol table info available. #1 0x0000000000000000 in ?? () No symbol table info available. Thread 8 (LWP 14687): #0 0x00007f9850143ef7 in ?? () No symbol table info available. #1 0x00007f9850143e30 in ?? () No symbol table info available. #2 0x00007f984d11ad28 in ?? () No symbol table info available. #3 0x00007f984d019700 in ?? () No symbol table info available. #4 0x0000000000000000 in ?? () No symbol table info available. Thread 7 (LWP 14685): #0 0x00007f9850146a82 in ?? () No symbol table info available. #1 0x0000100e00000000 in ?? () No symbol table info available. #2 0x00000000006dcfe0 in ?? () No symbol table info available. #3 0x00000000006dd020 in ?? () No symbol table info available. #4 0x0000000000001084 in ?? () No symbol table info available. #5 0x00007f9848000930 in ?? () No symbol table info available. #6 0x0000000000000000 in ?? () No symbol table info available. Thread 6 (LWP 14684): #0 0x00007f984fe36efd in ?? () No symbol table info available. #1 0x0000000000000002 in ?? () No symbol table info available. #2 0x00007f984fe67b34 in ?? () No symbol table info available. #3 0x0000000000000000 in ?? () No symbol table info available. Thread 5 (LWP 14830): #0 0x00007f98501466d5 in ?? () No symbol table info available. #1 0x0000026b00000000 in ?? () No symbol table info available. #2 0x00000000006de260 in ?? () No symbol table info available. #3 0x00000000006de2a0 in ?? () No symbol table info available. #4 0x000000000000ccb0 in ?? () No symbol table info available. #5 0x0000000000000000 in ?? () No symbol table info available. Thread 4 (LWP 14689): #0 0x00007f9850146a82 in ?? () No symbol table info available. #1 0x0000010500000000 in ?? () No symbol table info available. #2 0x00007f985090cbc0 in ?? () No symbol table info available. #3 0x00007f985090cc00 in ?? () No symbol table info available. #4 0x0000000000000218 in ?? () No symbol table info available. #5 0x0000000000000000 in ?? () No symbol table info available. Thread 3 (LWP 14823): #0 0x00007f9850143ef7 in ?? () No symbol table info available. #1 0x00007f9850143e30 in ?? () No symbol table info available. #2 0x00007f98471d5d28 in ?? () No symbol table info available. #3 0x00007f9846ed2700 in ?? () No symbol table info available. #4 0x0000000000000000 in ?? () No symbol table info available. Thread 2 (LWP 14686): #0 0x00007f984fe36efd in ?? () No symbol table info available. #1 0x0000000000000002 in ?? () No symbol table info available. #2 0x00007f984fe36d94 in ?? () No symbol table info available. #3 0x0000000000000003 in ?? () No symbol table info available. #4 0x000000003785cb7a in ?? () No symbol table info available. #5 0x0000000000010000 in ?? () No symbol table info available. #6 0x0000000000000000 in ?? () No symbol table info available. Thread 1 (LWP 31599): #0 0x00007f984fdae5f7 in ?? () No symbol table info available. #1 0x00007f984fdafce8 in ?? () No symbol table info available. #2 0x0000000000000020 in ?? () No symbol table info available. #3 0x0000000000000000 in ?? () No symbol table info available. And [root@m3-mgmt2 slurm-logs]# gdb -batch -ex "thread apply all bt full" core.32003 [New LWP 32010] [New LWP 32014] [New LWP 32012] [New LWP 32013] [New LWP 32011] [New LWP 32015] [New LWP 32022] [New LWP 32016] [New LWP 32020] [New LWP 32019] [New LWP 32004] [New LWP 32003] [New LWP 32005] [New LWP 32006] [New LWP 32008] Missing separate debuginfo for the main executable file Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/71/a7c60e3a83c09c01aec6f05752aef7f4e632e4 Core was generated by `/opt/slurm-17.11.4/sbin/slurmctld'. Program terminated with signal 6, Aborted. #0 0x00007efc7b4605f7 in ?? () "/mnt/slurm-logs/core.32003" is a core file. Please specify an executable to debug. Thread 15 (LWP 32008): #0 0x00007efc7b7f8a82 in ?? () No symbol table info available. #1 0x0002648a00000000 in ?? () No symbol table info available. #2 0x00007efc7bfbebc0 in ?? () No symbol table info available. #3 0x00007efc7bfbec00 in ?? () No symbol table info available. #4 0x000000000002c2a3 in ?? () No symbol table info available. #5 0x0000000000000000 in ?? () No symbol table info available. Thread 14 (LWP 32006): #0 0x00007efc7b7f5ef7 in ?? () No symbol table info available. #1 0x00007efc7b7f5e30 in ?? () No symbol table info available. #2 0x00007efc787ccd28 in ?? () No symbol table info available. #3 0x00007efc786cb700 in ?? () No symbol table info available. #4 0x0000000000000000 in ?? () No symbol table info available. Thread 13 (LWP 32005): #0 0x00007efc7b4e8efd in ?? () No symbol table info available. #1 0x0000000000000002 in ?? () No symbol table info available. #2 0x00007efc7b4e8d94 in ?? () No symbol table info available. #3 0x0000000000000001 in ?? () No symbol table info available. #4 0x000000001ac89147 in ?? () No symbol table info available. #5 0x0000000000010000 in ?? () No symbol table info available. #6 0x0000000000000000 in ?? () No symbol table info available. Thread 12 (LWP 32003): #0 0x00007efc7b4e8efd in ?? () No symbol table info available. #1 0x0000000000000002 in ?? () No symbol table info available. #2 0x00007efc7b519b34 in ?? () No symbol table info available. #3 0x0000000000000000 in ?? () No symbol table info available. Thread 11 (LWP 32004): #0 0x00007efc7b7f86d5 in ?? () No symbol table info available. #1 0x0005893000000000 in ?? () No symbol table info available. #2 0x00000000006ddd00 in ?? () No symbol table info available. #3 0x00000000006ddd40 in ?? () No symbol table info available. #4 0x00000000001bf787 in ?? () No symbol table info available. #5 0x0000000000101000 in ?? () No symbol table info available. #6 0x0000000000464f20 in ?? () No symbol table info available. #7 0x0000000000000000 in ?? () No symbol table info available. Thread 10 (LWP 32019): #0 0x00007efc7b7fbe91 in ?? () No symbol table info available. #1 0x0000000000000000 in ?? () No symbol table info available. Thread 9 (LWP 32020): #0 0x00007efc7b7f86d5 in ?? () No symbol table info available. #1 0x0002b50a00000000 in ?? () No symbol table info available. #2 0x00000000006de260 in ?? () No symbol table info available. #3 0x00000000006de2a0 in ?? () No symbol table info available. #4 0x00000000014738ae in ?? () No symbol table info available. #5 0x0000000000000000 in ?? () No symbol table info available. Thread 8 (LWP 32016): #0 0x00007efc7b519413 in ?? () No symbol table info available. #1 0x0000000000000002 in ?? () No symbol table info available. #2 0x0000000000424f09 in ?? () No symbol table info available. #3 0x0000000000000001 in ?? () No symbol table info available. #4 0x0000000000000000 in ?? () No symbol table info available. Thread 7 (LWP 32022): #0 0x00007efc7b7f86d5 in ?? () No symbol table info available. #1 0x0000000000000000 in ?? () No symbol table info available. Thread 6 (LWP 32015): #0 0x00007efc7b7f5ef7 in ?? () No symbol table info available. #1 0x00007efc7b7f5e30 in ?? () No symbol table info available. #2 0x00007efc726bad28 in ?? () No symbol table info available. #3 0x00007efc725b9700 in ?? () No symbol table info available. #4 0x0000000000000000 in ?? () No symbol table info available. Thread 5 (LWP 32011): #0 0x00007efc7b7f8a82 in ?? () No symbol table info available. #1 0x0000000000000000 in ?? () No symbol table info available. Thread 4 (LWP 32013): #0 0x00007efc7b7f8a82 in ?? () No symbol table info available. #1 0x0000000000000000 in ?? () No symbol table info available. Thread 3 (LWP 32012): #0 0x00007efc7b7f8a82 in ?? () No symbol table info available. #1 0x0000000000000000 in ?? () No symbol table info available. Thread 2 (LWP 32014): #0 0x00007efc7b4e8efd in ?? () No symbol table info available. #1 0x0000000000000002 in ?? () No symbol table info available. #2 0x00007efc7b4e8d94 in ?? () No symbol table info available. #3 0x0000000000000056 in ?? () No symbol table info available. #4 0x000000001d8da290 in ?? () No symbol table info available. #5 0x0000000000010000 in ?? () No symbol table info available. #6 0x0000000000000000 in ?? () No symbol table info available. Thread 1 (LWP 32010): #0 0x00007efc7b4605f7 in ?? () No symbol table info available. #1 0x00007efc7b461ce8 in ?? () No symbol table info available. #2 0x0000000000000020 in ?? () No symbol table info available. #3 0x0000000000000000 in ?? () No symbol table info available. Here is more info: scontrol show config Configuration data as of 2018-07-19T23:23:40 AccountingStorageBackupHost = m3-mgmt1 AccountingStorageEnforce = associations,limits,qos AccountingStorageHost = m3-mgmt2 AccountingStorageLoc = N/A AccountingStoragePort = 6819 AccountingStorageTRES = cpu,mem,energy,node,billing,gres/gpu AccountingStorageType = accounting_storage/slurmdbd AccountingStorageUser = N/A AccountingStoreJobComment = Yes AcctGatherEnergyType = acct_gather_energy/none AcctGatherFilesystemType = acct_gather_filesystem/none AcctGatherInterconnectType = acct_gather_interconnect/none AcctGatherNodeFreq = 0 sec AcctGatherProfileType = acct_gather_profile/none AllowSpecResourcesUsage = 0 AuthInfo = (null) AuthType = auth/munge BackupAddr = m3-mgmt1 BackupController = m3-mgmt1 BatchStartTimeout = 10 sec BOOT_TIME = 2018-07-19T18:11:26 BurstBufferType = (null) CheckpointType = checkpoint/none ChosLoc = (null) ClusterName = m3 CompleteWait = 10 sec ControlAddr = m3-mgmt2 ControlMachine = m3-mgmt2 CoreSpecPlugin = core_spec/none CpuFreqDef = Unknown CpuFreqGovernors = Performance,OnDemand CryptoType = crypto/munge DebugFlags = Gres DefMemPerNode = UNLIMITED DisableRootJobs = Yes EioTimeout = 60 EnforcePartLimits = ALL Epilog = /opt/slurm/etc/slurm.epilog EpilogMsgTime = 2000 usec EpilogSlurmctld = (null) ExtSensorsType = ext_sensors/none ExtSensorsFreq = 0 sec FairShareDampeningFactor = 1 FastSchedule = 1 FederationParameters = (null) FirstJobId = 1 GetEnvTimeout = 2 sec GresTypes = gpu GroupUpdateForce = 1 GroupUpdateTime = 600 sec HASH_VAL = Match HealthCheckInterval = 300 sec HealthCheckNodeState = ANY HealthCheckProgram = /opt/nhc-1.4.2/sbin/nhc InactiveLimit = 0 sec JobAcctGatherFrequency = 30 JobAcctGatherType = jobacct_gather/linux JobAcctGatherParams = (null) JobCheckpointDir = /var/slurm/checkpoint JobCompHost = localhost JobCompLoc = /var/log/slurm_jobcomp.log JobCompPort = 0 JobCompType = jobcomp/none JobCompUser = root JobContainerType = job_container/none JobCredentialPrivateKey = (null) JobCredentialPublicCertificate = (null) JobFileAppend = 0 JobRequeue = 1 JobSubmitPlugins = (null) KeepAliveTime = SYSTEM_DEFAULT KillOnBadExit = 0 KillWait = 10 sec LaunchParameters = (null) LaunchType = launch/slurm Layouts = Licenses = (null) LicensesUsed = (null) LogTimeFormat = iso8601_ms MailDomain = (null) MailProg = /bin/mail MaxArraySize = 1001 MaxJobCount = 15000 MaxJobId = 67043328 MaxMemPerNode = UNLIMITED MaxStepCount = 40000 MaxTasksPerNode = 512 MCSPlugin = mcs/none MCSParameters = (null) MemLimitEnforce = Yes MessageTimeout = 10 sec MinJobAge = 300 sec MpiDefault = pmi2 MpiParams = ports=12000-12999 MsgAggregationParams = (null) NEXT_JOB_ID = 2815470 NodeFeaturesPlugins = (null) OverTimeLimit = 1 min PluginDir = /opt/slurm-17.11.4/lib/slurm PlugStackConfig = /opt/slurm-17.11.4/etc/plugstack.conf PowerParameters = (null) PowerPlugin = PreemptMode = REQUEUE PreemptType = preempt/qos PriorityParameters = (null) PriorityDecayHalfLife = 14-00:00:00 PriorityCalcPeriod = 00:05:00 PriorityFavorSmall = Yes PriorityFlags = FAIR_TREE PriorityMaxAge = 14-00:00:00 PriorityUsageResetPeriod = NONE PriorityType = priority/multifactor PriorityWeightAge = 10000 PriorityWeightFairShare = 80000 PriorityWeightJobSize = 10000 PriorityWeightPartition = 10000 PriorityWeightQOS = 60000 PriorityWeightTRES = (null) PrivateData = none ProctrackType = proctrack/cgroup Prolog = /opt/slurm/etc/slurm.prolog PrologEpilogTimeout = 65534 PrologSlurmctld = (null) PrologFlags = (null) PropagatePrioProcess = 0 PropagateResourceLimits = ALL PropagateResourceLimitsExcept = (null) RebootProgram = (null) ReconfigFlags = (null) RequeueExit = (null) RequeueExitHold = (null) ResumeProgram = (null) ResumeRate = 300 nodes/min ResumeTimeout = 60 sec ResvEpilog = (null) ResvOverRun = 0 min ResvProlog = (null) ReturnToService = 1 RoutePlugin = route/default SallocDefaultCommand = (null) SbcastParameters = (null) SchedulerParameters = (null) SchedulerTimeSlice = 30 sec SchedulerType = sched/backfill SelectType = select/cons_res SelectTypeParameters = CR_CORE_MEMORY SlurmUser = slurm(497) SlurmctldDebug = debug3 SlurmctldLogFile = /mnt/slurm-logs/slurmctld.log SlurmctldPort = 6817 SlurmctldSyslogDebug = quiet SlurmctldTimeout = 300 sec SlurmdDebug = debug5 SlurmdLogFile = /var/log/slurmd.log SlurmdPidFile = /opt/slurm/var/run/slurmd.pid SlurmdPort = 6818 SlurmdSpoolDir = /opt/slurm/var/spool SlurmdSyslogDebug = quiet SlurmdTimeout = 300 sec SlurmdUser = root(0) SlurmSchedLogFile = /mnt/slurm-logs/slurmsched.log SlurmSchedLogLevel = 9 SlurmctldPidFile = /opt/slurm/var/run/slurmctld.pid SlurmctldPlugstack = (null) SLURM_CONF = /opt/slurm-17.11.4/etc/slurm.conf SLURM_VERSION = 17.11.4 SrunEpilog = (null) SrunPortRange = 0-0 SrunProlog = (null) StateSaveLocation = /opt/slurm/var/state SuspendExcNodes = (null) SuspendExcParts = (null) SuspendProgram = (null) SuspendRate = 60 nodes/min SuspendTime = NONE SuspendTimeout = 30 sec SwitchType = switch/none TaskEpilog = (null) TaskPlugin = task/cgroup TaskPluginParam = (null type) TaskProlog = (null) TCPTimeout = 2 sec TmpFS = /tmp TopologyParam = (null) TopologyPlugin = topology/none TrackWCKey = No TreeWidth = 50 UsePam = 0 UnkillableStepProgram = (null) UnkillableStepTimeout = 60 sec VSizeFactor = 0 percent WaitTime = 0 sec Slurmctld(primary/backup) at m3-mgmt2/m3-mgmt1 are UP/UP Kindly investigate. Thanks Damien Hi Let’s try to add binary path, maybe this will give us proper backtrace: gdb -batch -ex "thread apply all bt full" /opt/slurm-17.11.4/sbin/slurmctld core.14684 Dominik There you go: gdb -batch -ex "thread apply all bt full" /opt/slurm-17.11.4/sbin/slurmctld core.32003 [New LWP 32010] [New LWP 32014] [New LWP 32012] [New LWP 32013] [New LWP 32011] [New LWP 32015] [New LWP 32022] [New LWP 32016] [New LWP 32020] [New LWP 32019] [New LWP 32004] [New LWP 32003] [New LWP 32005] [New LWP 32006] [New LWP 32008] [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". Core was generated by `/opt/slurm-17.11.4/sbin/slurmctld'. Program terminated with signal 6, Aborted. #0 0x00007efc7b4605f7 in raise () from /lib64/libc.so.6 Thread 15 (Thread 0x7efc785ca700 (LWP 32008)): #0 0x00007efc7b7f8a82 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 No symbol table info available. #1 0x00007efc7bd4765f in _agent (x=<optimized out>) at slurmdbd_defs.c:1979 err = <optimized out> cnt = <optimized out> rc = <optimized out> buffer = <optimized out> abs_time = {tv_sec = 1531981566, tv_nsec = 0} fail_time = 0 sigarray = {10, 0} list_req = {msg_type = 1474, data = 0x7efc785c9ea0} list_msg = {my_list = 0x0, return_code = 0} __func__ = "_agent" #2 0x00007efc7b7f4dc5 in start_thread () from /lib64/libpthread.so.0 No symbol table info available. #3 0x00007efc7b521ced in clone () from /lib64/libc.so.6 No symbol table info available. Thread 14 (Thread 0x7efc786cb700 (LWP 32006)): #0 0x00007efc7b7f5ef7 in pthread_join () from /lib64/libpthread.so.0 No symbol table info available. #1 0x00007efc787cff6e in _cleanup_thread (no_data=<optimized out>) at accounting_storage_slurmdbd.c:445 No locals. #2 0x00007efc7b7f4dc5 in start_thread () from /lib64/libpthread.so.0 No symbol table info available. #3 0x00007efc7b521ced in clone () from /lib64/libc.so.6 No symbol table info available. Thread 13 (Thread 0x7efc787cc700 (LWP 32005)): #0 0x00007efc7b4e8efd in nanosleep () from /lib64/libc.so.6 No symbol table info available. #1 0x00007efc7b4e8d94 in sleep () from /lib64/libc.so.6 No symbol table info available. #2 0x00007efc787d080e in _set_db_inx_thread (no_data=<optimized out>) at accounting_storage_slurmdbd.c:437 local_job_list = <optimized out> job_ptr = <optimized out> itr = <optimized out> job_read_lock = {config = NO_LOCK, job = READ_LOCK, node = NO_LOCK, partition = NO_LOCK, federation = NO_LOCK} job_write_lock = {config = NO_LOCK, job = WRITE_LOCK, node = NO_LOCK, partition = NO_LOCK, federation = NO_LOCK} __func__ = "_set_db_inx_thread" #3 0x00007efc7b7f4dc5 in start_thread () from /lib64/libpthread.so.0 No symbol table info available. #4 0x00007efc7b521ced in clone () from /lib64/libc.so.6 No symbol table info available. Thread 12 (Thread 0x7efc7c1d1740 (LWP 32003)): #0 0x00007efc7b4e8efd in nanosleep () from /lib64/libc.so.6 No symbol table info available. #1 0x00007efc7b519b34 in usleep () from /lib64/libc.so.6 No symbol table info available. #2 0x00000000004279f4 in _slurmctld_background (no_data=0x0) at controller.c:1767 i = 8 job_limit = <optimized out> delta_t = 7 last_full_sched_time = 1531981512 last_ctld_bu_ping = 1531981415 last_uid_update = 1531980772 last_reboot_msg_time = 1531199546 ping_interval = 100 job_read_lock = {config = READ_LOCK, job = READ_LOCK, node = NO_LOCK, partition = NO_LOCK, federation = NO_LOCK} job_write_lock = {config = READ_LOCK, job = WRITE_LOCK, node = WRITE_LOCK, partition = READ_LOCK, federation = READ_LOCK} job_node_read_lock = {config = NO_LOCK, job = READ_LOCK, node = READ_LOCK, partition = NO_LOCK, federation = NO_LOCK} last_group_time = 1531981473 last_acct_gather_node_time = 1531199545 last_ext_sensors_time = 1531199545 last_resv_time = 1531981556 tv1 = {tv_sec = 1531981560, tv_usec = 209825} node_write_lock2 = {config = NO_LOCK, job = NO_LOCK, node = WRITE_LOCK, partition = NO_LOCK, federation = NO_LOCK} last_timelimit_time = 1531981536 last_assert_primary_time = 1531981358 purge_job_interval = 60 tv2 = {tv_sec = 1531981560, tv_usec = 209832} config_read_lock = {config = READ_LOCK, job = NO_LOCK, node = NO_LOCK, partition = NO_LOCK, federation = NO_LOCK} node_write_lock = {config = READ_LOCK, job = WRITE_LOCK, node = WRITE_LOCK, partition = NO_LOCK, federation = NO_LOCK} last_purge_job_time = 1531981512 last_node_acct = 1531981285 no_resp_msg_interval = <optimized out> tv_str = "usec=7\000\000\064\066\000\067\000\000\000\000\000\000\000" job_write_lock2 = {config = NO_LOCK, job = WRITE_LOCK, node = NO_LOCK, partition = NO_LOCK, federation = NO_LOCK} last_no_resp_msg_time = 1531981560 now = <optimized out> last_sched_time = 1531981554 last_ping_node_time = 1531981506 part_write_lock = {config = NO_LOCK, job = NO_LOCK, node = NO_LOCK, partition = WRITE_LOCK, federation = NO_LOCK} last_health_check_time = 1531981363 last_checkpoint_time = 1531981452 last_ping_srun_time = 1531199545 last_trigger = 1531981546 #3 main (argc=<optimized out>, argv=<optimized out>) at controller.c:604 cnt = <optimized out> error_code = <optimized out> i = 3 stat_buf = {st_dev = 64769, st_ino = 143988, st_nlink = 1, st_mode = 33261, st_uid = 0, st_gid = 0, __pad0 = 0, st_rdev = 0, st_size = 392784, st_blksize = 4096, st_blocks = 768, st_atim = {tv_sec = 1531188475, tv_nsec = 698018084}, st_mtim = {tv_sec = 1418762451, tv_nsec = 0}, st_ctim = {tv_sec = 1463116446, tv_nsec = 736207571}, __unused = {0, 0, 0}} rlim = {rlim_cur = 18446744073709551615, rlim_max = 18446744073709551615} config_write_lock = {config = WRITE_LOCK, job = WRITE_LOCK, node = WRITE_LOCK, partition = WRITE_LOCK, federation = NO_LOCK} node_part_write_lock = {config = NO_LOCK, job = NO_LOCK, node = WRITE_LOCK, partition = WRITE_LOCK, federation = NO_LOCK} callbacks = {acct_full = 0x4a93eb <trigger_primary_ctld_acct_full>, dbd_fail = 0x4a95fa <trigger_primary_dbd_fail>, dbd_resumed = 0x4a9688 <trigger_primary_dbd_res_op>, db_fail = 0x4a970d <trigger_primary_db_fail>, db_resumed = 0x4a979b <trigger_primary_db_res_op>} create_clustername_file = 44 __func__ = "main" Thread 11 (Thread 0x7efc7c1d0700 (LWP 32004)): #0 0x00007efc7b7f86d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 No symbol table info available. #1 0x0000000000464f20 in _wr_wrlock (datatype=datatype@entry=JOB_LOCK) at locks.c:229 err = <optimized out> __func__ = "_wr_wrlock" #2 0x000000000046516c in lock_slurmctld (lock_levels=...) at locks.c:133 No locals. #3 0x000000000041e1ac in _agent_retry (mail_too=false, min_wait=999) at agent.c:1381 agent_arg_ptr = 0x0 mi = 0x0 rc = <optimized out> now = 1531981561 queued_req_ptr = 0x0 retry_iter = <optimized out> job_write_lock = {config = NO_LOCK, job = WRITE_LOCK, node = NO_LOCK, partition = NO_LOCK, federation = NO_LOCK} #4 _agent_init (arg=<optimized out>) at agent.c:1326 min_wait = 999 mail_too = false ts = {tv_sec = 1531981562, tv_nsec = 0} __func__ = "_agent_init" #5 0x00007efc7b7f4dc5 in start_thread () from /lib64/libpthread.so.0 No symbol table info available. #6 0x00007efc7b521ced in clone () from /lib64/libc.so.6 No symbol table info available. Thread 10 (Thread 0x7efc721b5700 (LWP 32019)): #0 0x00007efc7b7fbe91 in sigwait () from /lib64/libpthread.so.0 No symbol table info available. #1 0x000000000042925c in _slurmctld_signal_hand (no_data=<optimized out>) at controller.c:891 sig = 1 i = <optimized out> rc = <optimized out> sig_array = {2, 15, 1, 6, 12, 0} set = {__val = {18467, 0 <repeats 15 times>}} __func__ = "_slurmctld_signal_hand" #2 0x00007efc7b7f4dc5 in start_thread () from /lib64/libpthread.so.0 No symbol table info available. #3 0x00007efc7b521ced in clone () from /lib64/libc.so.6 No symbol table info available. Thread 9 (Thread 0x7efc720b4700 (LWP 32020)): #0 0x00007efc7b7f86d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 No symbol table info available. #1 0x000000000049ee9c in slurmctld_state_save (no_data=<optimized out>) at state_save.c:204 err = <optimized out> last_save = 1531981553 now = 1531981553 save_delay = <optimized out> run_save = <optimized out> save_count = 0 __func__ = "slurmctld_state_save" #2 0x00007efc7b7f4dc5 in start_thread () from /lib64/libpthread.so.0 No symbol table info available. #3 0x00007efc7b521ced in clone () from /lib64/libc.so.6 No symbol table info available. Thread 8 (Thread 0x7efc722b6700 (LWP 32016)): #0 0x00007efc7b519413 in select () from /lib64/libc.so.6 No symbol table info available. #1 0x0000000000424f09 in _slurmctld_rpc_mgr (no_data=<optimized out>) at controller.c:1026 max_fd = <optimized out> newsockfd = <optimized out> sockfd = 0x7efc5c000950 cli_addr = {sin_family = 2, sin_port = 35469, sin_addr = {s_addr = 4039643308}, sin_zero = "\000\000\000\000\000\000\000"} srv_addr = {sin_family = 2, sin_port = 41242, sin_addr = {s_addr = 0}, sin_zero = "\000\000\000\000\000\000\000"} port = 41242 ip = "0.0.0.0", '\000' <repeats 24 times> fd_next = 0 i = <optimized out> nports = 1 rfds = {__fds_bits = {8, 0 <repeats 15 times>}} conn_arg = <optimized out> config_read_lock = {config = READ_LOCK, job = NO_LOCK, node = NO_LOCK, partition = NO_LOCK, federation = NO_LOCK} sigarray = {10, 0} node_addr = <optimized out> __func__ = "_slurmctld_rpc_mgr" #2 0x00007efc7b7f4dc5 in start_thread () from /lib64/libpthread.so.0 No symbol table info available. #3 0x00007efc7b521ced in clone () from /lib64/libc.so.6 No symbol table info available. Thread 7 (Thread 0x7efc71eb2700 (LWP 32022)): #0 0x00007efc7b7f86d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 No symbol table info available. #1 0x0000000000423536 in _purge_files_thread (no_data=<optimized out>) at controller.c:3160 err = <optimized out> job_id = 0x0 __func__ = "_purge_files_thread" #2 0x00007efc7b7f4dc5 in start_thread () from /lib64/libpthread.so.0 No symbol table info available. #3 0x00007efc7b521ced in clone () from /lib64/libc.so.6 No symbol table info available. Thread 6 (Thread 0x7efc725b9700 (LWP 32015)): #0 0x00007efc7b7f5ef7 in pthread_join () from /lib64/libpthread.so.0 No symbol table info available. #1 0x00007efc729bfe75 in _cleanup_thread (no_data=<optimized out>) at priority_multifactor.c:1453 No locals. #2 0x00007efc7b7f4dc5 in start_thread () from /lib64/libpthread.so.0 No symbol table info available. #3 0x00007efc7b521ced in clone () from /lib64/libc.so.6 No symbol table info available. Thread 5 (Thread 0x7efc72eca700 (LWP 32011)): #0 0x00007efc7b7f8a82 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 No symbol table info available. #1 0x000000000043543a in _heartbeat_thread (no_data=<optimized out>) at heartbeat.c:130 err = <optimized out> beat = 30 now = <optimized out> nl = 16730398017500217344 ts = {tv_sec = 1531981574, tv_nsec = 0} reg_file = 0x0 new_file = 0x0 fd = 7 __func__ = "_heartbeat_thread" #2 0x00007efc7b7f4dc5 in start_thread () from /lib64/libpthread.so.0 No symbol table info available. #3 0x00007efc7b521ced in clone () from /lib64/libc.so.6 No symbol table info available. Thread 4 (Thread 0x7efc72cc8700 (LWP 32013)): #0 0x00007efc7b7f8a82 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 No symbol table info available. #1 0x000000000043060d in _fed_job_update_thread (arg=<optimized out>) at fed_mgr.c:2161 err = <optimized out> ts = {tv_sec = 1531981562, tv_nsec = 0} job_update_info = <optimized out> __func__ = "_fed_job_update_thread" #2 0x00007efc7b7f4dc5 in start_thread () from /lib64/libpthread.so.0 No symbol table info available. #3 0x00007efc7b521ced in clone () from /lib64/libc.so.6 No symbol table info available. Thread 3 (Thread 0x7efc72dc9700 (LWP 32012)): #0 0x00007efc7b7f8a82 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 No symbol table info available. #1 0x000000000042c963 in _agent_thread (arg=<optimized out>) at fed_mgr.c:2203 err = <optimized out> cluster = <optimized out> ts = {tv_sec = 1531981562, tv_nsec = 0} cluster_iter = <optimized out> rpc_iter = <optimized out> rpc_rec = <optimized out> req_msg = {address = {sin_family = 0, sin_port = 0, sin_addr = {s_addr = 0}, sin_zero = "\000\000\000\000\000\000\000"}, auth_cred = 0x0, body_offset = 0, buffer = 0x0, conn = 0x0, conn_fd = 0, data = 0x0, data_size = 0, flags = 0, msg_index = 0, msg_type = 0, protocol_version = 0, forward = {cnt = 0, init = 0, nodelist = 0x0, timeout = 0, tree_width = 0}, forward_struct = 0x0, orig_addr = {sin_family = 0, sin_port = 0, sin_addr = {s_addr = 0}, sin_zero = "\000\000\000\000\000\000\000"}, ret_list = 0x0} resp_msg = {address = {sin_family = 0, sin_port = 0, sin_addr = {s_addr = 0}, sin_zero = "\000\000\000\000\000\000\000"}, auth_cred = 0x0, body_offset = 0, buffer = 0x0, conn = 0x0, conn_fd = 0, data = 0x0, data_size = 0, flags = 0, msg_index = 0, msg_type = 0, protocol_version = 0, forward = {cnt = 0, init = 0, nodelist = 0x0, timeout = 0, tree_width = 0}, forward_struct = 0x0, orig_addr = {sin_family = 0, sin_port = 0, sin_addr = {s_addr = 0}, sin_zero = "\000\000\000\000\000\000\000"}, ret_list = 0x0} ctld_req_msg = {my_list = 0x0} success_bits = <optimized out> rc = <optimized out> resp_inx = <optimized out> success_size = <optimized out> fed_read_lock = {config = NO_LOCK, job = NO_LOCK, node = NO_LOCK, partition = NO_LOCK, federation = READ_LOCK} __func__ = "_agent_thread" #2 0x00007efc7b7f4dc5 in start_thread () from /lib64/libpthread.so.0 No symbol table info available. #3 0x00007efc7b521ced in clone () from /lib64/libc.so.6 No symbol table info available. Thread 2 (Thread 0x7efc726ba700 (LWP 32014)): #0 0x00007efc7b4e8efd in nanosleep () from /lib64/libc.so.6 No symbol table info available. #1 0x00007efc7b4e8d94 in sleep () from /lib64/libc.so.6 No symbol table info available. #2 0x00007efc729c2596 in _decay_thread (no_data=<optimized out>) at priority_multifactor.c:1333 start_time = 1531981347 last_reset = 1469517764 next_reset = 0 calc_period = 300 decay_hl = <optimized out> reset_period = 0 now = 1531981347 run_delta = <optimized out> real_decay = <optimized out> elapsed = <optimized out> job_write_lock = {config = NO_LOCK, job = WRITE_LOCK, node = READ_LOCK, partition = READ_LOCK, federation = NO_LOCK} locks = {assoc = WRITE_LOCK, file = NO_LOCK, qos = NO_LOCK, res = NO_LOCK, tres = NO_LOCK, user = NO_LOCK, wckey = NO_LOCK} __func__ = "_decay_thread" #3 0x00007efc7b7f4dc5 in start_thread () from /lib64/libpthread.so.0 No symbol table info available. #4 0x00007efc7b521ced in clone () from /lib64/libc.so.6 No symbol table info available. Thread 1 (Thread 0x7efc735ed700 (LWP 32010)): #0 0x00007efc7b4605f7 in raise () from /lib64/libc.so.6 No symbol table info available. #1 0x00007efc7b461ce8 in abort () from /lib64/libc.so.6 No symbol table info available. #2 0x00007efc7b459566 in __assert_fail_base () from /lib64/libc.so.6 No symbol table info available. #3 0x00007efc7b459612 in __assert_fail () from /lib64/libc.so.6 No symbol table info available. #4 0x00007efc7bc72a21 in bit_ffs (b=<optimized out>) at bitstring.c:475 bit = 0 value = -1 __PRETTY_FUNCTION__ = "bit_ffs" #5 0x00007efc79dfb8cc in _build_row_bitmaps (p_ptr=p_ptr@entry=0x7efc4c892110, job_ptr=job_ptr@entry=0x7efc44446bd0) at select_cons_res.c:677 i = 0 j = 0 num_jobs = 8 size = <optimized out> x = 0 this_row = <optimized out> orig_row = 0x7efc4c8c60b0 ss = 0x7efc4c7511a0 __func__ = "_build_row_bitmaps" #6 0x00007efc79dfc053 in _rm_job_from_res (part_record_ptr=part_record_ptr@entry=0x7efc4c0142c0, node_usage=node_usage@entry=0x7efc4c209940, job_ptr=job_ptr@entry=0x7efc44446bd0, action=action@entry=0) at select_cons_res.c:1294 p_ptr = 0x7efc4c892110 job = 0x7efc4c643a20 node_ptr = <optimized out> first_bit = 0 last_bit = <optimized out> i = <optimized out> n = <optimized out> gres_list = <optimized out> __func__ = "_rm_job_from_res" #7 0x00007efc79dfd07c in _will_run_test (exc_core_bitmap=0x0, preemptee_job_list=0x7efc735ecab8, preemptee_candidates=0x0, job_node_req=<optimized out>, req_nodes=1, max_nodes=1, min_nodes=1, bitmap=0x7efc4c0feaa0, job_ptr=0x7efc4c0e5e90) at select_cons_res.c:1931 first_job_ptr = 0x7efc44446bd0 next_job_ptr = <optimized out> overlap = <optimized out> last_job_ptr = 0x7efc44446bd0 rm_job_cnt = 0 tv1 = {tv_sec = 1531981561, tv_usec = 105137} tv_str = '\000' <repeats 19 times> delta_t = 139621943865360 time_window = 30 more_jobs = true tv2 = {tv_sec = 139622874530176, tv_usec = 139622071926816} cr_job_list = 0x7efc44b91fb0 tmp_cr_type = 20 future_part = 0x7efc4c0142c0 tmp_job_ptr = 0x7efc44446bd0 preemptee_iterator = <optimized out> orig_map = 0x7efc4c014300 qos_preemptor = false future_usage = 0x7efc4c209940 job_iterator = 0x7efc74000990 action = <optimized out> rc = -1 now = 1531981561 #8 select_p_job_test (job_ptr=0x7efc4c0e5e90, bitmap=0x7efc4c0feaa0, min_nodes=1, max_nodes=1, req_nodes=1, mode=<optimized out>, preemptee_candidates=0x0, preemptee_job_list=0x7efc735ecab8, exc_core_bitmap=0x0) at select_cons_res.c:2310 rc = 22 debug_cpu_bind = false debug_check = true #9 0x00007efc7bca9a3c in select_g_job_test (job_ptr=job_ptr@entry=0x7efc4c0e5e90, bitmap=0x7efc4c0feaa0, min_nodes=min_nodes@entry=1, max_nodes=max_nodes@entry=1, req_nodes=req_nodes@entry=1, mode=mode@entry=2, preemptee_candidates=preemptee_candidates@entry=0x0, preemptee_job_list=preemptee_job_list@entry=0x7efc735ecab8, exc_core_bitmap=exc_core_bitmap@entry=0x0) at node_select.c:582 No locals. #10 0x00007efc735f2f39 in _try_sched (job_ptr=job_ptr@entry=0x7efc4c0e5e90, avail_bitmap=avail_bitmap@entry=0x7efc735ecdf8, min_nodes=1, max_nodes=1, req_nodes=1, exc_core_bitmap=0x0) at backfill.c:482 orig_shared = 254 now = 1531981561 str = "\300\000\000\000\000\000\000\000\a\000\000\000\000\000\000\000\002\000\000\000\000\000\000\000\070\000\000\000\000\000\000\000(\000\000\000\000\000\000\000\340\063\326{\374~\000\000\247\000\000\000\000\000\000\000e9\326{\374~\000\000\305\065P[\000\000\000\000\301I\325{\374~\000\000\240\316^s\374~\000\000\240\000\000\000\000\000\000\000\240\316^s" low_bitmap = 0x0 tmp_bitmap = 0x7efc4c014280 rc = 0 has_xor = false feat_cnt = 0 detail_ptr = <optimized out> preemptee_candidates = 0x0 preemptee_job_list = 0x0 feat_iter = <optimized out> feat_ptr = <optimized out> __func__ = "_try_sched" #11 0x00007efc735f5677 in _attempt_backfill () at backfill.c:1894 bf_job_id = <optimized out> bf_array_task_id = <optimized out> bf_job_priority = <optimized out> tv1 = {tv_sec = 1531981560, tv_usec = 876382} tv2 = {tv_sec = 0, tv_usec = 139622873694297} tv_str = '\000' <repeats 19 times> delta_t = 139622873694297 job_queue = <optimized out> job_queue_rec = 0x0 bb = <optimized out> i = <optimized out> j = <optimized out> k = <optimized out> node_space_recs = <optimized out> mcs_select = <optimized out> qos_ptr = <optimized out> job_ptr = 0x7efc4c0e5e90 part_ptr = <optimized out> bf_part_ptr = 0x0 end_time = 1531983301 end_reserve = <optimized out> deadline_time_limit = <optimized out> boot_time = 0 orig_end_time = <optimized out> time_limit = <optimized out> comp_time_limit = <optimized out> orig_time_limit = <optimized out> part_time_limit = <optimized out> min_nodes = 1 max_nodes = 1 req_nodes = 1 active_bitmap = 0x0 avail_bitmap = 0x7efc4c0feaa0 exc_core_bitmap = 0x0 resv_bitmap = 0x7efc4c00da50 now = 1531981561 sched_start = <optimized out> later_start = 0 start_res = 1531981561 resv_end = <optimized out> window_end = <optimized out> orig_sched_start = <optimized out> orig_start_time = <optimized out> node_space = 0x7efc4c0f26a0 bf_user_part_ptr = 0x0 bf_time1 = {tv_sec = 1531981560, tv_usec = 877289} bf_time2 = {tv_sec = 1531981530, tv_usec = 876177} rc = 0 error_code = <optimized out> job_test_count = <optimized out> test_time_count = <optimized out> pend_time = <optimized out> uid = 0x0 nuser = <optimized out> bf_parts = <optimized out> bf_part_jobs = 0x0 bf_part_resv = 0x0 njobs = 0x0 already_counted = true reject_array_job_id = <optimized out> reject_array_part = <optimized out> job_start_cnt = <optimized out> start_time = <optimized out> config_update = <optimized out> part_update = <optimized out> start_tv = {tv_sec = 1531981560, tv_usec = 876400} test_array_job_id = <optimized out> test_array_count = <optimized out> job_no_reserve = <optimized out> resv_overlap = true save_share_res = <optimized out> save_whole_node = <optimized out> test_fini = -1 user_part_inx1 = <optimized out> user_part_inx2 = <optimized out> part_inx = <optimized out> user_inx = <optimized out> qos_flags = <optimized out> qos_blocked_until = <optimized out> qos_part_blocked_until = <optimized out> qos_read_lock = {assoc = NO_LOCK, file = NO_LOCK, qos = READ_LOCK, res = NO_LOCK, tres = NO_LOCK, user = NO_LOCK, wckey = NO_LOCK} __func__ = "_attempt_backfill" #12 0x00007efc735f7bc0 in backfill_agent (args=<optimized out>) at backfill.c:904 now = <optimized out> wait_time = <optimized out> last_backfill_time = 1531981530 all_locks = {config = READ_LOCK, job = WRITE_LOCK, node = WRITE_LOCK, partition = READ_LOCK, federation = READ_LOCK} load_config = <optimized out> short_sleep = <optimized out> backfill_cnt = 23555 __func__ = "backfill_agent" #13 0x00007efc7b7f4dc5 in start_thread () from /lib64/libpthread.so.0 No symbol table info available. #14 0x00007efc7b521ced in clone () from /lib64/libc.so.6 No symbol table info available. Thanks Damien Hi Could you use interactively gdb on core.32003 and run these commands? t 1 f 7 p tmp_job_ptr p rm_job_cnt Dominik Hi Dominik I’m not familiar with gdb. Can you give me more details? Thanks Damien On Friday, 20 July 2018, <bugs@schedmd.com> wrote: > *Comment # 9 <https://bugs.schedmd.com/show_bug.cgi?id=5452#c9> on bug > 5452 <https://bugs.schedmd.com/show_bug.cgi?id=5452> from Dominik > Bartkiewicz <bart@schedmd.com> * > > Hi > > Could you use interactively gdb on core.32003 and run these commands? > t 1 > f 7 > p tmp_job_ptr > p rm_job_cnt > > Dominik > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > > Hi Of course gdb /opt/slurm-17.11.4/sbin/slurmctld core.32003 then you should see propt like this "(gdb)" Go to thread 1 t 1 then pick frame 7 f 7 And you can print some values p tmp_job_ptr p rm_job_cnt Dominik Hi Let me know if this is clear. Could you send me the value of ss[x].tmpjobs? eg.: thread 1 frame 5 print ss[x].tmpjobs Dominik Hi Dominik There are the values: tmp]# gdb /opt/slurm-17.11.4/sbin/slurmctld core.32003 GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-80.el7 Copyright (C) 2013 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-redhat-linux-gnu". For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>... Reading symbols from /opt/slurm-17.11.4/sbin/slurmctld...done. [New LWP 32010] [New LWP 32014] [New LWP 32012] [New LWP 32013] [New LWP 32011] [New LWP 32015] [New LWP 32022] [New LWP 32016] [New LWP 32020] [New LWP 32019] [New LWP 32004] [New LWP 32003] [New LWP 32005] [New LWP 32006] [New LWP 32008] [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". Core was generated by `/opt/slurm-17.11.4/sbin/slurmctld'. Program terminated with signal 6, Aborted. #0 0x00007efc7b4605f7 in raise () from /lib64/libc.so.6 Missing separate debuginfos, use: debuginfo-install glibc-2.17-106.el7_2.8.x86_64 sssd-client-1.13.0-40.el7_2.12.x86_64 (gdb) t 1 [Switching to thread 1 (Thread 0x7efc735ed700 (LWP 32010))] #0 0x00007efc7b4605f7 in raise () from /lib64/libc.so.6 (gdb) f 7 #7 0x00007efc79dfd07c in _will_run_test (exc_core_bitmap=0x0, preemptee_job_list=0x7efc735ecab8, preemptee_candidates=0x0, job_node_req=<optimized out>, req_nodes=1, max_nodes=1, min_nodes=1, bitmap=0x7efc4c0feaa0, job_ptr=0x7efc4c0e5e90) at select_cons_res.c:1931 1931 select_cons_res.c: No such file or directory. (gdb) p tmp_job_ptr $1 = (struct job_record *) 0x7efc44446bd0 (gdb) p rm_job_cnt $2 = 0 (gdb) Hi Dominik There are the extras: (gdb) thread 1 [Switching to thread 1 (Thread 0x7efc735ed700 (LWP 32010))] #7 0x00007efc79dfd07c in _will_run_test (exc_core_bitmap=0x0, preemptee_job_list=0x7efc735ecab8, preemptee_candidates=0x0, job_node_req=<optimized out>, req_nodes=1, max_nodes=1, min_nodes=1, bitmap=0x7efc4c0feaa0, job_ptr=0x7efc4c0e5e90) at select_cons_res.c:1931 1931 in select_cons_res.c (gdb) frame 5 #5 0x00007efc79dfb8cc in _build_row_bitmaps (p_ptr=p_ptr@entry=0x7efc4c892110, job_ptr=job_ptr@entry=0x7efc44446bd0) at select_cons_res.c:677 677 in select_cons_res.c (gdb) print ss[x].tmpjobs $7 = (struct job_resources *) 0x7efc44660740 (gdb) I hope that this is sufficient, else please let us know. Many Thanks Damien Hi Thank you I appreciate your efforts and patience. I should already ask for this, could you attach this? thread 1 frame 5 info locals print *(ss[x].tmpjobs) t 1 f 7 info locals print *tmp_job_ptr Dominik Hi Dominik There you goes: (gdb) (gdb) thread 1 [Switching to thread 1 (Thread 0x7efc735ed700 (LWP 32010))] #0 0x00007efc7b4605f7 in raise () from /lib64/libc.so.6 (gdb) frame 5 #5 0x00007efc79dfb8cc in _build_row_bitmaps (p_ptr=p_ptr@entry=0x7efc4c892110, job_ptr=job_ptr@entry=0x7efc44446bd0) at select_cons_res.c:677 677 select_cons_res.c: No such file or directory. (gdb) info locals i = 0 j = 0 num_jobs = 8 size = <optimized out> x = 0 this_row = <optimized out> orig_row = 0x7efc4c8c60b0 ss = 0x7efc4c7511a0 __func__ = "_build_row_bitmaps" (gdb) print *(ss[x].tmpjobs) $1 = {core_bitmap = 0x0, core_bitmap_used = 0x0, cpu_array_cnt = 1, cpu_array_value = 0x0, cpu_array_reps = 0x0, cpus = 0x0, cpus_used = 0x0, cores_per_socket = 0x0, memory_allocated = 0x0, memory_used = 0x0, nhosts = 1, node_bitmap = 0x0, node_req = 1, nodes = 0x0, ncpus = 1, sock_core_rep_count = 0x0, sockets_per_node = 0x0, whole_node = 0 '\000'} (gdb) $2 = {core_bitmap = 0x0, core_bitmap_used = 0x0, cpu_array_cnt = 1, cpu_array_value = 0x0, cpu_array_reps = 0x0, cpus = 0x0, cpus_used = 0x0, cores_per_socket = 0x0, memory_allocated = 0x0, memory_used = 0x0, nhosts = 1, node_bitmap = 0x0, node_req = 1, nodes = 0x0, ncpus = 1, sock_core_rep_count = 0x0, sockets_per_node = 0x0, whole_node = 0 '\000'} (gdb) $3 = {core_bitmap = 0x0, core_bitmap_used = 0x0, cpu_array_cnt = 1, cpu_array_value = 0x0, cpu_array_reps = 0x0, cpus = 0x0, cpus_used = 0x0, cores_per_socket = 0x0, memory_allocated = 0x0, memory_used = 0x0, nhosts = 1, node_bitmap = 0x0, node_req = 1, nodes = 0x0, ncpus = 1, sock_core_rep_count = 0x0, sockets_per_node = 0x0, whole_node = 0 '\000'} (gdb) t 1 [Switching to thread 1 (Thread 0x7efc735ed700 (LWP 32010))] #5 0x00007efc79dfb8cc in _build_row_bitmaps (p_ptr=p_ptr@entry=0x7efc4c892110, job_ptr=job_ptr@entry=0x7efc44446bd0) at select_cons_res.c:677 677 in select_cons_res.c (gdb) f 5 #5 0x00007efc79dfb8cc in _build_row_bitmaps (p_ptr=p_ptr@entry=0x7efc4c892110, job_ptr=job_ptr@entry=0x7efc44446bd0) at select_cons_res.c:677 677 in select_cons_res.c (gdb) f 7 #7 0x00007efc79dfd07c in _will_run_test (exc_core_bitmap=0x0, preemptee_job_list=0x7efc735ecab8, preemptee_candidates=0x0, job_node_req=<optimized out>, req_nodes=1, max_nodes=1, min_nodes=1, bitmap=0x7efc4c0feaa0, job_ptr=0x7efc4c0e5e90) at select_cons_res.c:1931 1931 in select_cons_res.c (gdb) info locals first_job_ptr = 0x7efc44446bd0 next_job_ptr = <optimized out> overlap = <optimized out> last_job_ptr = 0x7efc44446bd0 rm_job_cnt = 0 tv1 = {tv_sec = 1531981561, tv_usec = 105137} tv_str = '\000' <repeats 19 times> delta_t = 139621943865360 time_window = 30 more_jobs = true tv2 = {tv_sec = 139622874530176, tv_usec = 139622071926816} cr_job_list = 0x7efc44b91fb0 tmp_cr_type = 20 future_part = 0x7efc4c0142c0 tmp_job_ptr = 0x7efc44446bd0 preemptee_iterator = <optimized out> orig_map = 0x7efc4c014300 qos_preemptor = false future_usage = 0x7efc4c209940 job_iterator = 0x7efc74000990 action = <optimized out> rc = -1 now = 1531981561 (gdb) print *tmp_job_ptr $4 = {account = 0x7efc4423e7c0 "ax22", admin_comment = 0x0, alias_list = 0x0, alloc_node = 0x7efc446f8000 "m3-login2", alloc_resp_port = 0, alloc_sid = 667, array_job_id = 2814050, array_task_id = 1, array_recs = 0x0, assoc_id = 1337, assoc_ptr = 0x111d1c0, batch_flag = 1, batch_host = 0x7efc4c22a170 "m3a000", billable_tres = 6, bit_flags = 0, burst_buffer = 0x0, burst_buffer_state = 0x0, check_job = 0x0, ckpt_interval = 0, ckpt_time = 0, clusters = 0x0, comment = 0x0, cpu_cnt = 6, cr_enabled = 1, db_index = 0, deadline = 0, delay_boot = 0, derived_ec = 0, details = 0x7efc442249f0, direct_set_prio = 0, end_time = 1531983301, end_time_exp = 1531983301, epilog_running = false, exit_code = 0, fed_details = 0x0, front_end_ptr = 0x0, gids = 0x0, gres = 0x0, gres_list = 0x0, gres_alloc = 0x7efc4c40b0d0 "", gres_detail_cnt = 0, gres_detail_str = 0x0, gres_req = 0x7efc4c2653c0 "", gres_used = 0x0, group_id = 10025, job_id = 2814051, job_next = 0x0, job_array_next_j = 0x0, job_array_next_t = 0x0, job_resrcs = 0x7efc4c643a20, job_state = 1, kill_on_node_fail = 1, last_sched_eval = 1531981561, licenses = 0x0, license_list = 0x0, limit_set = {qos = 0, time = 0, tres = 0x7efc4414c940}, mail_type = 0, mail_user = 0x0, magic = 4038539564, mcs_label = 0x0, name = 0x7efc446f7fd0 "seecr19july", network = 0x0, next_step_id = 0, ngids = 0, nodes = 0x7efc4c2653a0 "m3a000", node_addr = 0x7efc4c119940, node_bitmap = 0x7efc4c148520, node_bitmap_cg = 0x0, node_cnt = 1, node_cnt_wag = 1, nodes_completing = 0x0, origin_cluster = 0x0, other_port = 0, pack_job_id = 0, pack_job_id_set = 0x0, pack_job_offset = 0, pack_job_list = 0x0, partition = 0x7efc444ba010 "short", part_ptr_list = 0x0, part_nodes_missing = false, part_ptr = 0x7efc44863820, power_flags = 0 '\000', pre_sus_time = 0, preempt_time = 0, preempt_in_progress = false, priority = 73105, priority_array = 0x0, prio_factors = 0x7efc446f7f40, profile = 4294967295, qos_id = 1, qos_ptr = 0x10a3920, qos_blocking_ptr = 0x0, reboot = 0 '\000', restart_cnt = 0, resize_time = 0, resv_id = 0, resv_name = 0x0, resv_ptr = 0x0, requid = 4294967295, resp_host = 0x0, sched_nodes = 0x0, select_jobinfo = 0x7efc4489c500, spank_job_env = 0x0, spank_job_env_size = 0, start_protocol_ver = 8192, start_time = 1531981561, state_desc = 0x0, state_reason = 0, state_reason_prev = 0, step_list = 0x7efc449009c0, suspend_time = 0, time_last_active = 1531981561, time_limit = 29, time_min = 0, tot_sus_time = 0, total_cpus = 6, total_nodes = 1, tres_req_cnt = 0x7efc44c1d9b0, tres_req_str = 0x7efc445ecca0 "1=6,2=8000,4=1", tres_fmt_req_str = 0x7efc4428bd50 "cpu=6,mem=8000M,node=1", tres_alloc_cnt = 0x7efc4c645f00, tres_alloc_str = 0x7efc4c0bd590 "1=6,2=8000,3=18446744073709551614,4=1,5=6", tres_fmt_alloc_str = 0x7efc4c40b040 "cpu=6,mem=8000M,node=1,billing=6", user_id = 11014, user_name = 0x0, wait_all_nodes = 0, warn_flags = 0, warn_signal = 0, warn_time = 0, wckey = 0x0, req_switch = 0, wait4switch = 0, best_switch = true, wait4switch_start = 0} (gdb) I hope that you find the problem. Many Thanks. Damien Hi We are still investigating this issue. Does this still occur? Dominik Hi Dominik It has not reappear now, but this has crashed twice last Thursday night, and once about a month ago. We are looking whether there is a preventative measure that we can use, or whether it is a CPU-load issue or configuration problem ? Cheers Damien Hi This patch should fix this issue. It hasn't been committed yet, but we think it will be soon in this or similar form. Dominik Hi This is fixed in commit: https://github.com/SchedMD/slurm/commit/fef07a409724 I'm going to go ahead and mark this as Resolved/Fixed, please feel free to re-open this if there's anything else we can help with. Dominik *** Ticket 5447 has been marked as a duplicate of this ticket. *** *** Ticket 5438 has been marked as a duplicate of this ticket. *** *** Ticket 5675 has been marked as a duplicate of this ticket. *** |