| Summary: | after removal of reservation slurmctld is crashing at start with a coredump | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Michael Hebenstreit <michael.hebenstreit> |
| Component: | slurmctld | Assignee: | Marcin Stolarek <cinek> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | CC: | bart, cinek, nate, richard.dahringer |
| Version: | 20.02.5 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| See Also: | https://bugs.schedmd.com/show_bug.cgi?id=9720 | ||
| Site: | Intel CRT | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | 21.08pre1 | Target Release: | --- |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: |
output from starting slurmctld as root with option -v -v -v -D
workaround (v1) |
||
Could you please load the core into gdb with your slurmctld binary and share the backtrace from it: gdb /path/to/slurmctld /path/to/core.XX (gdb)t a a bt f? cheers, Marcin [root@eslurm1 ~]# gdb /opt/slurm/20.02.3/sbin/slurmctld /var/lib/systemd/coredump/core.slurmctld.509.00fce0901101407abc89e88bd95377c3.12557.1603808600000000 GNU gdb (GDB) Red Hat Enterprise Linux 8.2-11.el8 Copyright (C) 2018 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-redhat-linux-gnu". Type "show configuration" for configuration details. For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>. Find the GDB manual and other documentation resources online at: <http://www.gnu.org/software/gdb/documentation/>. For help, type "help". Type "apropos word" to search for commands related to "word"... Reading symbols from /opt/slurm/20.02.3/sbin/slurmctld...done. [New LWP 12557] [New LWP 12561] [New LWP 12563] [New LWP 12558] [New LWP 12559] [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". warning: Loadable section ".note.gnu.property" outside of ELF segments warning: Loadable section ".note.gnu.property" outside of ELF segments warning: Loadable section ".note.gnu.property" outside of ELF segments warning: Loadable section ".note.gnu.property" outside of ELF segments warning: Loadable section ".note.gnu.property" outside of ELF segments warning: Loadable section ".note.gnu.property" outside of ELF segments warning: Loadable section ".note.gnu.property" outside of ELF segments warning: Loadable section ".note.gnu.property" outside of ELF segments warning: Loadable section ".note.gnu.property" outside of ELF segments warning: Loadable section ".note.gnu.property" outside of ELF segments warning: Loadable section ".note.gnu.property" outside of ELF segments warning: Loadable section ".note.gnu.property" outside of ELF segments Core was generated by `/opt/slurm/20.02.3/sbin/slurmctld -D -vvv'. Program terminated with signal SIGSEGV, Segmentation fault. #0 0x000014a142e80e59 in bit_clear (b=0x0, bit=bit@entry=277) at bitstring.c:258 258 b[_bit_word(bit)] &= ~_bit_mask(bit); [Current thread is 1 (Thread 0x14a14340e100 (LWP 12557))] Missing separate debuginfos, use: yum debuginfo-install glibc-2.28-101.el8.x86_64 libblkid-2.32.1-22.el8.x86_64 libcap-2.26-3.el8.x86_64 libgcc-8.3.1-5.el8.0.2.x86_64 libmount-2.32.1-22.el8.x86_64 libselinux-2.9-3.el8.x86_64 libuuid-2.32.1-22.el8.x86_64 munge-libs-0.5.13-1.el8.x86_64 openssl-libs-1.1.1c-15.el8.x86_64 systemd-libs-239-31.el8_2.2.x86_64 zlib-1.2.11-16.el8_2.x86_64 (gdb) t a a bt f? Thread 5 (Thread 0x14a13eafb700 (LWP 12559)): No symbol "f" in current context. (gdb) t a a bt f Thread 5 (Thread 0x14a13eafb700 (LWP 12559)): #0 0x000014a1427df7da in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 No symbol table info available. #1 0x000014a13eb01073 in _set_db_inx_thread (no_data=<optimized out>) at accounting_storage_slurmdbd.c:451 err = <optimized out> local_job_list = <optimized out> job_ptr = <optimized out> itr = <optimized out> tvnow = {tv_sec = 1603808597, tv_usec = 749411} abs = {tv_sec = 1603808602, tv_nsec = 749411000} job_read_lock = {conf = NO_LOCK, job = READ_LOCK, node = NO_LOCK, part = NO_LOCK, fed = NO_LOCK} job_write_lock = {conf = NO_LOCK, job = WRITE_LOCK, node = NO_LOCK, part = NO_LOCK, fed = NO_LOCK} __func__ = "_set_db_inx_thread" #2 0x000014a1427d92de in start_thread () from /lib64/libpthread.so.0 No symbol table info available. #3 0x000014a14250ae83 in clone () from /lib64/libc.so.6 No symbol table info available. Thread 4 (Thread 0x14a14340c700 (LWP 12558)): #0 0x000014a1427df7da in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 No symbol table info available. #1 0x0000000000421fef in _agent_init (arg=<optimized out>) at agent.c:1395 err = <optimized out> min_wait = <optimized out> mail_too = <optimized out> ts = {tv_sec = 1603808601, tv_nsec = 0} last_defer_attempt = 0 __func__ = "_agent_init" #2 0x000014a1427d92de in start_thread () from /lib64/libpthread.so.0 No symbol table info available. #3 0x000014a14250ae83 in clone () from /lib64/libc.so.6 No symbol table info available. Thread 3 (Thread 0x14a13d37c700 (LWP 12563)): #0 0x000014a1427df7da in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 No symbol table info available. #1 0x000014a13d381110 in _my_sleep (usec=30000000) at backfill.c:668 err = <optimized out> nsec = <optimized out> sleep_time = 0 ts = {tv_sec = 1603808629, tv_nsec = 525767000} tv1 = {tv_sec = 1603808599, tv_usec = 525767} tv2 = {tv_sec = 0, tv_usec = 0} __func__ = "_my_sleep" #2 0x000014a13d388644 in backfill_agent (args=<optimized out>) at backfill.c:1028 now = <optimized out> wait_time = <optimized out> last_backfill_time = 1603808599 all_locks = {conf = READ_LOCK, job = WRITE_LOCK, node = WRITE_LOCK, part = READ_LOCK, fed = READ_LOCK} load_config = <optimized out> short_sleep = false backfill_cnt = 0 __func__ = "backfill_agent" --Type <RET> for more, q to quit, c to continue without paging--c #3 0x000014a1427d92de in start_thread () from /lib64/libpthread.so.0 No symbol table info available. #4 0x000014a14250ae83 in clone () from /lib64/libc.so.6 No symbol table info available. Thread 2 (Thread 0x14a13e9fa700 (LWP 12561)): #0 0x000014a1424fff21 in poll () from /lib64/libc.so.6 No symbol table info available. #1 0x000014a142f04885 in _conn_readable (persist_conn=persist_conn@entry=0x7c5240) at slurm_persist_conn.c:138 ufds = {fd = 6, events = 1, revents = 0} rc = <optimized out> time_left = 900000 __func__ = "_conn_readable" #2 0x000014a142f05e77 in slurm_persist_recv_msg (persist_conn=0x7c5240) at slurm_persist_conn.c:925 msg_size = 226 nw_size = 3791650816 msg = 0x14a134004c70 "" msg_read = <optimized out> offset = 0 buffer = <optimized out> __func__ = "slurm_persist_recv_msg" #3 0x000014a13eb072c4 in _handle_mult_rc_ret () at slurmdbd_agent.c:189 buffer = <optimized out> msg_type = 0 msg = 0x0 list_msg = 0x0 rc = -1 out_buf = 0x0 buffer = <optimized out> msg_type = <optimized out> msg = <optimized out> list_msg = <optimized out> rc = <optimized out> out_buf = <optimized out> __func__ = "_handle_mult_rc_ret" err = <optimized out> itr = <optimized out> b = <optimized out> err = <optimized out> #4 _agent (x=<optimized out>) at slurmdbd_agent.c:852 rc = 0 cnt = 12 buffer = 0xcc4ba0 abs_time = {tv_sec = 1603808607, tv_nsec = 0} fail_time = 0 sigarray = {10, 0} list_req = {conn = 0x0, data = 0x14a13e9f9e70, data_size = 0, msg_type = 1474} list_msg = {my_list = 0x7b5bb0, return_code = 0} tv1 = {tv_sec = 1603808600, tv_usec = 250167} tv2 = {tv_sec = 1603808600, tv_usec = 250165} tv_str = "usec=44067\000\000\000\000\000\000\000\000\000" delta_t = 44067 __func__ = "_agent" #5 0x000014a1427d92de in start_thread () from /lib64/libpthread.so.0 No symbol table info available. #6 0x000014a14250ae83 in clone () from /lib64/libc.so.6 No symbol table info available. Thread 1 (Thread 0x14a14340e100 (LWP 12557)): #0 0x000014a142e80e59 in bit_clear (b=0x0, bit=bit@entry=277) at bitstring.c:258 No locals. #1 0x0000000000479e41 in make_node_idle (node_ptr=node_ptr@entry=0x14a1432a3c38, job_ptr=job_ptr@entry=0xca7950) at node_mgr.c:3859 inx = 277 node_flags = <optimized out> now = 1603808600 node_bitmap = <optimized out> __func__ = "make_node_idle" #2 0x0000000000446c4d in excise_node_from_job (job_ptr=job_ptr@entry=0xca7950, node_ptr=node_ptr@entry=0x14a1432a3c38) at job_mgr.c:4146 i = <optimized out> i_first = <optimized out> i_last = <optimized out> orig_pos = -1 new_pos = -1 orig_bitmap = 0xd6f580 __func__ = "excise_node_from_job" #3 0x00000000004a2e4a in _sync_nodes_to_active_job (job_ptr=0xca7950) at read_config.c:2669 save_accounting_enforce = 11 i = 277 cnt = 0 node_flags = <optimized out> node_ptr = 0x14a1432a3c38 i = <optimized out> cnt = <optimized out> node_flags = <optimized out> node_ptr = <optimized out> __func__ = "_sync_nodes_to_active_job" save_accounting_enforce = <optimized out> #4 _sync_nodes_to_jobs (reconfig=false) at read_config.c:2548 job_ptr = 0xca7950 job_iterator = 0xd22c40 update_cnt = 0 job_ptr = <optimized out> job_iterator = <optimized out> update_cnt = <optimized out> #5 read_slurm_conf (recover=<optimized out>, reconfig=reconfig@entry=false) at read_config.c:1375 tv1 = {tv_sec = 1603808599, tv_usec = 486287} tv2 = {tv_sec = 0, tv_usec = 0} tv_str = '\000' <repeats 19 times> delta_t = 0 error_code = 0 i = <optimized out> rc = 0 load_job_ret = 0 old_node_record_count = <optimized out> old_node_table_ptr = <optimized out> node_ptr = <optimized out> do_reorder_nodes = <optimized out> old_part_list = <optimized out> old_def_part_name = <optimized out> old_auth_type = 0xaed980 "auth/munge" old_bb_type = 0x0 old_cred_type = 0xaed950 "cred/munge" old_preempt_mode = <optimized out> old_preempt_type = 0xaed8c0 "preempt/none" old_sched_type = 0xaed5c0 "sched/backfill" old_select_type = 0xac5160 "select/cons_res" old_switch_type = 0xac5190 "switch/none" state_save_dir = 0xaed9b0 "/opt/slurm/current/var/spool" mpi_params = 0x0 old_select_type_p = <optimized out> cgroup_mem_confinement = <optimized out> __func__ = "read_slurm_conf" #6 0x000000000042e578 in main (argc=<optimized out>, argv=<optimized out>) at controller.c:680 cnt = <optimized out> error_code = <optimized out> i = 3 start = {tv_sec = 1603808597, tv_usec = 743865} now = {tv_sec = 1603808597, tv_usec = 743891} stat_buf = {st_dev = 2054, st_ino = 1521783, st_nlink = 1, st_mode = 33261, st_uid = 0, st_gid = 0, __pad0 = 0, st_rdev = 0, st_size = 34280, st_blksize = 4096, st_blocks = 72, st_atim = {tv_sec = 1591804356, tv_nsec = 0}, st_mtim = {tv_sec = 1591804356, tv_nsec = 0}, st_ctim = {tv_sec = 1600794204, tv_nsec = 406611702}, __glibc_reserved = {0, 0, 0}} rlim = {rlim_cur = 18446744073709551615, rlim_max = 18446744073709551615} config_write_lock = {conf = WRITE_LOCK, job = WRITE_LOCK, node = WRITE_LOCK, part = WRITE_LOCK, fed = NO_LOCK} callbacks = {acct_full = 0x4bd08e <trigger_primary_ctld_acct_full>, dbd_fail = 0x4bd275 <trigger_primary_dbd_fail>, dbd_resumed = 0x4bd2f9 <trigger_primary_dbd_res_op>, db_fail = 0x4bd374 <trigger_primary_db_fail>, db_resumed = 0x4bd3f8 <trigger_primary_db_res_op>} prep_callbacks = {prolog_slurmctld = 0x48d36f <prep_prolog_slurmctld_callback>, epilog_slurmctld = 0x48d625 <prep_epilog_slurmctld_callback>} create_clustername_file = <optimized out> conf_file = <optimized out> __func__ = "main" From: bugs@schedmd.com <bugs@schedmd.com> Sent: Tuesday, October 27, 2020 8:29 AM To: Hebenstreit, Michael <michael.hebenstreit@intel.com> Subject: [Bug 10069] after removal of reservation slurmctld is crashing at start with a coredump Marcin Stolarek<mailto:cinek@schedmd.com> changed bug 10069<https://bugs.schedmd.com/show_bug.cgi?id=10069> What Removed Added CC cinek@schedmd.com<mailto:cinek@schedmd.com> Comment # 1<https://bugs.schedmd.com/show_bug.cgi?id=10069#c1> on bug 10069<https://bugs.schedmd.com/show_bug.cgi?id=10069> from Marcin Stolarek<mailto:cinek@schedmd.com> Could you please load the core into gdb with your slurmctld binary and share the backtrace from it: gdb /path/to/slurmctld /path/to/core.XX (gdb)t a a bt f? cheers, Marcin ________________________________ You are receiving this mail because: * You reported the bug. Created attachment 16372 [details]
workaround (v1)
Michael,
I'm looking into the details. Could you please apply the attached patch as a temporary workaround that should let you start slurmctld.
cheers,
Marcin
Could you please elaborate a little bit more on the subjec? What are the steps that resulted in your case? Did the workaround patch work for you? cheers, Marcin Just finished applying the patch
Cluster working again but some strange results in sinfo are still present:
[root@eslurm1 ~]# squeue -p idealq
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
2829 idealq gamess_d aknyaze1 PD 0:00 32 (Resources)
2830 idealq gamess_d aknyaze1 PD 0:00 48 (ReqNodeNotAvail, Reserved for maintenance)
2831 idealq gamess_d aknyaze1 PD 0:00 64 (ReqNodeNotAvail, Reserved for maintenance)
2833 idealq gamess_d aknyaze1 PD 0:00 32 (ReqNodeNotAvail, Reserved for maintenance)
2834 idealq gamess_d aknyaze1 PD 0:00 48 (ReqNodeNotAvail, Reserved for maintenance)
2835 idealq gamess_d aknyaze1 PD 0:00 64 (ReqNodeNotAvail, Reserved for maintenance)
2836 idealq gamess_d aknyaze1 PD 0:00 16 (ReqNodeNotAvail, Reserved for maintenance)
2837 idealq gamess_d aknyaze1 PD 0:00 32 (ReqNodeNotAvail, Reserved for maintenance)
2838 idealq gamess_d aknyaze1 PD 0:00 48 (ReqNodeNotAvail, Reserved for maintenance)
2839 idealq gamess_d aknyaze1 PD 0:00 64 (ReqNodeNotAvail, Reserved for maintenance)
2840 idealq gamess_d aknyaze1 PD 0:00 16 (ReqNodeNotAvail, Reserved for maintenance)
2841 idealq gamess_d aknyaze1 PD 0:00 32 (ReqNodeNotAvail, Reserved for maintenance)
2842 idealq gamess_d aknyaze1 PD 0:00 48 (ReqNodeNotAvail, Reserved for maintenance)
2843 idealq gamess_d aknyaze1 PD 0:00 64 (ReqNodeNotAvail, Reserved for maintenance)
2882 idealq gamess_d aknyaze1 PD 0:00 16 (ReqNodeNotAvail, Reserved for maintenance)
2883 idealq gamess_d aknyaze1 PD 0:00 32 (ReqNodeNotAvail, Reserved for maintenance)
2884 idealq gamess_d aknyaze1 PD 0:00 48 (ReqNodeNotAvail, Reserved for maintenance)
2885 idealq gamess_d aknyaze1 PD 0:00 64 (ReqNodeNotAvail, Reserved for maintenance)
2886 idealq gamess_d aknyaze1 PD 0:00 16 (ReqNodeNotAvail, Reserved for maintenance)
2887 idealq gamess_d aknyaze1 PD 0:00 32 (ReqNodeNotAvail, Reserved for maintenance)
2888 idealq gamess_d aknyaze1 PD 0:00 48 (ReqNodeNotAvail, Reserved for maintenance)
2889 idealq gamess_d aknyaze1 PD 0:00 64 (ReqNodeNotAvail, Reserved for maintenance)
2890 idealq gamess_d aknyaze1 PD 0:00 16 (ReqNodeNotAvail, Reserved for maintenance)
2891 idealq gamess_d aknyaze1 PD 0:00 32 (ReqNodeNotAvail, Reserved for maintenance)
2892 idealq gamess_d aknyaze1 PD 0:00 48 (ReqNodeNotAvail, Reserved for maintenance)
2893 idealq gamess_d aknyaze1 PD 0:00 64 (ReqNodeNotAvail, Reserved for maintenance)
2894 idealq gamess_d aknyaze1 PD 0:00 16 (ReqNodeNotAvail, Reserved for maintenance)
2895 idealq gamess_d aknyaze1 PD 0:00 32 (ReqNodeNotAvail, Reserved for maintenance)
2896 idealq gamess_d aknyaze1 PD 0:00 48 (ReqNodeNotAvail, Reserved for maintenance)
2897 idealq gamess_d aknyaze1 PD 0:00 64 (ReqNodeNotAvail, Reserved for maintenance)
2960 idealq hycom_lr msazhin PD 0:00 29 (ReqNodeNotAvail, Reserved for maintenance)
2334 idealq gamess_d aknyaze1 PD 0:00 43 (Priority)
2774 idealq gamess_d aknyaze1 R 5:57 57 eia[076-080,092-097,106-126,154-159,161-179]
2832 idealq gamess_d aknyaze1 R 5:57 16 eia[072,086-090,100-104,145-149]
[root@eslurm1 ~]# scontrol show partition=idealq
PartitionName=idealq
AllowGroups=idealqug AllowAccounts=ALL AllowQos=ALL
AllocNodes=ALL Default=NO QoS=N/A
DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
MaxNodes=UNLIMITED MaxTime=60-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED
Nodes=ekln[01-02],eca[001-036,038-040],eia[072-073,075-126,145-152,154-179]
PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=YES:4
OverTimeLimit=NONE PreemptMode=OFF
State=UP TotalCPUs=20208 TotalNodes=129 SelectTypeParameters=NONE
JobDefaults=(null)
DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED
The jobs are not released because of “maintenance”, when in reality the nodes are simply not free.
In the morning I received information jobs were not starting. Reason was soon found – I had left behind a weekly reservation. I deleted the reservation with “scontrol delete reservation=test”, but no change. I decided to kill slurmctld and restart it. That’s when the issues started. Here is the entry of test from mysql database:
MariaDB [slurm_acct_db]> select * from endeavour_resv_table where resv_name='test';
+---------+---------+-----------+--------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+-----------+------------+------------+----------+-------------+
| id_resv | deleted | assoclist | flags | nodelist | node_inx | resv_name | time_start | time_end | tres | unused_wall |
+---------+---------+-----------+--------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+-----------+------------+------------+----------+-------------+
| 8 | 0 | 1 | 557137 | eca[001-036,038-088],ecnv[01-10],ecx[001-009],ega[001-080],ehx[01-02,04],eia[181-275,278-301,303-304,306-370,501-511,513-536,538-540],ekln[01-02],elb[01-04],elc[02-04],elh[01-04],els[02-05],enb[101-108,201-216,301-316,401-408,501-524],epb[001-252,254-360,501-519,555-584,587-590,601-618,701-718,801-818],epx[03-13,15-17],est[01-02] | 0-983 | test | 1603206000 | 1603206300 | 1=112392 | 300 |
+---------+---------+-----------+--------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+-----------+------------+------------+----------+-------------+
1 row in set (0.000 sec)
From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Tuesday, October 27, 2020 9:36 AM
To: Hebenstreit, Michael <michael.hebenstreit@intel.com>
Subject: [Bug 10069] after removal of reservation slurmctld is crashing at start with a coredump
Comment # 5<https://bugs.schedmd.com/show_bug.cgi?id=10069#c5> on bug 10069<https://bugs.schedmd.com/show_bug.cgi?id=10069> from Marcin Stolarek<mailto:cinek@schedmd.com>
Could you please elaborate a little bit more on the subjec? What are the steps
that resulted in your case?
Did the workaround patch work for you?
cheers,
Marcin
________________________________
You are receiving this mail because:
* You reported the bug.
>Just finished applying the patch >Cluster working again but some strange results in sinfo are still present: That's positive. Are you OK with lowering the ticket serverity to 3 now? >The jobs are not released because of “maintenance”, when in reality the nodes are simply not free. This is known missleading value of state reason (Bug 9720), which should not impact jobs scheduling. >In the morning I received information jobs were not starting. Reason was soon found – I had left behind a weekly reservation. I deleted the reservation with “scontrol delete reservation=test”, but no change. I decided to kill slurmctld and restart it. That’s when the issues started. Here is the entry of test from mysql database: Where there any changes to partition configuration in the meantime? cheers, Marcin No changes to partitions AFAIK From: bugs@schedmd.com <bugs@schedmd.com> Sent: Tuesday, October 27, 2020 9:52 AM To: Hebenstreit, Michael <michael.hebenstreit@intel.com> Subject: [Bug 10069] after removal of reservation slurmctld is crashing at start with a coredump Comment # 7<https://bugs.schedmd.com/show_bug.cgi?id=10069#c7> on bug 10069<https://bugs.schedmd.com/show_bug.cgi?id=10069> from Marcin Stolarek<mailto:cinek@schedmd.com> >Just finished applying the patch >Cluster working again but some strange results in sinfo are still present: That's positive. Are you OK with lowering the ticket serverity to 3 now? >The jobs are not released because of “maintenance”, when in reality the nodes are simply not free. This is known missleading value of state reason (Bug 9720<show_bug.cgi?id=9720>), which should not impact jobs scheduling. >In the morning I received information jobs were not starting. Reason was soon found – I had left behind a weekly reservation. I deleted the reservation with “scontrol delete reservation=test”, but no change. I decided to kill slurmctld and restart it. That’s when the issues started. Here is the entry of test from mysql database: Where there any changes to partition configuration in the meantime? cheers, Marcin ________________________________ You are receiving this mail because: * You reported the bug. Any idea when the state update bug will be fixed? Thanks Michael From: bugs@schedmd.com <bugs@schedmd.com> Sent: Tuesday, October 27, 2020 9:52 AM To: Hebenstreit, Michael <michael.hebenstreit@intel.com> Subject: [Bug 10069] after removal of reservation slurmctld is crashing at start with a coredump Comment # 7<https://bugs.schedmd.com/show_bug.cgi?id=10069#c7> on bug 10069<https://bugs.schedmd.com/show_bug.cgi?id=10069> from Marcin Stolarek<mailto:cinek@schedmd.com> >Just finished applying the patch >Cluster working again but some strange results in sinfo are still present: That's positive. Are you OK with lowering the ticket serverity to 3 now? >The jobs are not released because of “maintenance”, when in reality the nodes are simply not free. This is known missleading value of state reason (Bug 9720<show_bug.cgi?id=9720>), which should not impact jobs scheduling. >In the morning I received information jobs were not starting. Reason was soon found – I had left behind a weekly reservation. I deleted the reservation with “scontrol delete reservation=test”, but no change. I decided to kill slurmctld and restart it. That’s when the issues started. Here is the entry of test from mysql database: Where there any changes to partition configuration in the meantime? cheers, Marcin ________________________________ You are receiving this mail because: * You reported the bug. Are you OK with lowering the ticket serverity to 3 now? >Any idea when the state update bug will be fixed? That should rather go the mentioned bug. Shortly - there is a patch there being actively reviewed by our QA team and a workaround mentioned in bug 9720 comment 7. cheers, Marcin Michael - I am lowering the priority since the cluster is up and running again, and we have a workaround of the other issue you brought. Let us know if you have questions. So you think I can run with the patch in place? From: bugs@schedmd.com <bugs@schedmd.com> Sent: Tuesday, October 27, 2020 10:44 AM To: Hebenstreit, Michael <michael.hebenstreit@intel.com> Subject: [Bug 10069] after removal of reservation slurmctld is crashing at start with a coredump Jason Booth<mailto:jbooth@schedmd.com> changed bug 10069<https://bugs.schedmd.com/show_bug.cgi?id=10069> What Removed Added Severity 1 - System not usable 3 - Medium Impact Comment # 11<https://bugs.schedmd.com/show_bug.cgi?id=10069#c11> on bug 10069<https://bugs.schedmd.com/show_bug.cgi?id=10069> from Jason Booth<mailto:jbooth@schedmd.com> Michael - I am lowering the priority since the cluster is up and running again, and we have a workaround of the other issue you brought. Let us know if you have questions. ________________________________ You are receiving this mail because: * You reported the bug. (In reply to Michael Hebenstreit from comment #12) > So you think I can run with the patch in place? Yes, the patch should resolve the crashing but we will likely have a different long term solution. I'm going to lower this to SEV4 as it is now a research bug. Please respond if you have any more issues and we can increase the severity. Thanks, --Nate Michael, The workaround you applied is safe and you don't have to remove it. However, the final patch that landed in 21.08 is a little bit different, commits: 8410df4e0deb0670..19e8311e74ae2cb. I'm closing the case as fixed now. best regards, Marcin |
Created attachment 16371 [details] output from starting slurmctld as root with option -v -v -v -D Oct 27 07:54:08 eslurm1 systemd-coredump[11882]: Process 11874 (slurmctld) of user 509 dumped core. Stack trace of thread 11874: #0 0x000014a0fc542e80 bit_clear (libslurmfull.so) #1 0x000000000047979c make_node_idle (slurmctld) #2 0x0000000000446c43 excise_node_from_job (slurmctld) #3 0x00000000004a2816 _sync_nodes_to_active_job (slurmctld) #4 0x000000000042e554 main (slurmctld) #5 0x000014a0fbaf46a3 __libc_start_main (libc.so.6) #6 0x0000000000418c1e _start (slurmctld) Stack trace of thread 11879: #0 0x000014a0fbea17da pthread_cond_timedwait@@GLIBC_2.3.2 (libpthread.so.0) #1 0x000014a0f2987110 _my_sleep (sched_backfill.so) #2 0x000014a0f298e3a8 backfill_agent (sched_backfill.so) #3 0x000014a0fbe9b2de start_thread (libpthread.so.0) #4 0x000014a0fbbcce83 __clone (libc.so.6) Stack trace of thread 11878: #0 0x000014a0fbbc1f21 __poll (libc.so.6) #1 0x000014a0fc5c6b77 _conn_readable (libslurmfull.so) #2 0x000014a0fc5c80cd slurm_persist_recv_msg (libslurmfull.so) #3 0x000014a0f81ca29e _handle_mult_rc_ret (accounting_storage_slurmdbd.so) #4 0x000014a0fbe9b2de start_thread (libpthread.so.0) #5 0x000014a0fbbcce83 __clone (libc.so.6) Stack trace of thread 11875: #0 0x000014a0fbea17da pthread_cond_timedwait@@GLIBC_2.3.2 (libpthread.so.0) #1 0x0000000000421fcb _agent_init (slurmctld) #2 0x000014a0fbe9b2de start_thread (libpthread.so.0) #3 0x000014a0fbbcce83 __clone (libc.so.6) Stack trace of thread 11876: #0 0x000014a0fbea17da pthread_cond_timedwait@@GLIBC_2.3.2 (libpthread.so.0) #1 0x000014a0f81c4073 _set_db_inx_thread (accounting_storage_slurmdbd.so) #2 0x000014a0fbe9b2de start_thread (libpthread.so.0) #3 0x000014a0fbbcce83 __clone (libc.so.6)