6270 – after core network outage slurmctld will not start - fails with segfault

Ticket 6270 - after core network outage slurmctld will not start - fails with segfault

Summary: after core network outage slurmctld will not start - fails with segfault

Status:	RESOLVED INFOGIVEN

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	slurmctld (show other tickets)
Version:	17.11.7
Hardware:	Linux Linux

Severity:	3 - Medium Impact
Assignee:	Jason Booth
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2018-12-19 10:28 MST by Jenny Williams
Modified:	2018-12-26 09:32 MST (History)
CC List:	0 users

See Also:
Site:	University of North Carolina at Chapel Hill
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	RHEL
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
patch_to_avoid_segfault (448 bytes, patch) 2018-12-19 11:06 MST, Jason Booth	Details \| Diff
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Jenny Williams 2018-12-19 10:28:22 MST

This is the last of the slurmctld -D -vvv output when attempted

slurmctld: Recovered JobID=3915222 State=0x1 NodeCnt=0 Assoc=1376
slurmctld: recovered job step 3915224.4294967295
slurmctld: Recovered JobID=3566919_79(3915224) State=0x1 NodeCnt=0 Assoc=2818
slurmctld: Recovered JobID=3566919_* State=0x0 NodeCnt=0 Assoc=2818
slurmctld: recovered job step 3915225.4294967295
slurmctld: Recovered JobID=3915225 State=0x1 NodeCnt=0 Assoc=1376
slurmctld: recovered job step 3915226.4294967295
slurmctld: Recovered JobID=3915226 State=0x1 NodeCnt=0 Assoc=1376
slurmctld: recovered job step 3915227.4294967295
slurmctld: Recovered JobID=3915227 State=0x1 NodeCnt=0 Assoc=1376
slurmctld: Recovered JobID=3915228 State=0x0 NodeCnt=0 Assoc=476
slurmctld: Recovered JobID=3915229 State=0x0 NodeCnt=0 Assoc=948
slurmctld: Recovered information about 42675 jobs
slurmctld: cons_res: select_p_node_init
slurmctld: cons_res: preparing for 12 partitions
slurmctld: debug2: init_requeue_policy: kill_invalid_depend is set to 1
slurmctld: _sync_nodes_to_comp_job: Job 3845366 in completing state
slurmctld: debug2: acct_policy_job_begin: after adding job 3845366, qos bigmem_access grp_used_tres_run_secs(cpu) is 129600
slurmctld: debug2: acct_policy_job_begin: after adding job 3845366, qos bigmem_access grp_used_tres_run_secs(mem) is 305233920000
slurmctld: debug2: acct_policy_job_begin: after adding job 3845366, qos bigmem_access grp_used_tres_run_secs(node) is 129600
slurmctld: debug2: acct_policy_job_begin: after adding job 3845366, qos bigmem_access grp_used_tres_run_secs(billing) is 129600
slurmctld: debug2: acct_policy_job_begin: after adding job 3845366, qos bigmem_access grp_used_tres_run_secs(gres/gpu) is 0
slurmctld: debug2: acct_policy_job_begin: after adding job 3845366, assoc 1009(rc_dklotsa_pi/minzhi/(null)) grp_used_tres_run_secs(cpu) is 129600
slurmctld: debug2: acct_policy_job_begin: after adding job 3845366, assoc 1009(rc_dklotsa_pi/minzhi/(null)) grp_used_tres_run_secs(mem) is 305233920000
slurmctld: debug2: acct_policy_job_begin: after adding job 3845366, assoc 1009(rc_dklotsa_pi/minzhi/(null)) grp_used_tres_run_secs(node) is 129600
slurmctld: debug2: acct_policy_job_begin: after adding job 3845366, assoc 1009(rc_dklotsa_pi/minzhi/(null)) grp_used_tres_run_secs(billing) is 129600
slurmctld: debug2: acct_policy_job_begin: after adding job 3845366, assoc 1009(rc_dklotsa_pi/minzhi/(null)) grp_used_tres_run_secs(gres/gpu) is 0
slurmctld: debug2: acct_policy_job_begin: after adding job 3845366, assoc 385(rc_dklotsa_pi/(null)/(null)) grp_used_tres_run_secs(cpu) is 129600
slurmctld: debug2: acct_policy_job_begin: after adding job 3845366, assoc 385(rc_dklotsa_pi/(null)/(null)) grp_used_tres_run_secs(mem) is 305233920000
slurmctld: debug2: acct_policy_job_begin: after adding job 3845366, assoc 385(rc_dklotsa_pi/(null)/(null)) grp_used_tres_run_secs(node) is 129600
slurmctld: debug2: acct_policy_job_begin: after adding job 3845366, assoc 385(rc_dklotsa_pi/(null)/(null)) grp_used_tres_run_secs(billing) is 129600
slurmctld: debug2: acct_policy_job_begin: after adding job 3845366, assoc 385(rc_dklotsa_pi/(null)/(null)) grp_used_tres_run_secs(gres/gpu) is 0
slurmctld: debug2: acct_policy_job_begin: after adding job 3845366, assoc 1(root/(null)/(null)) grp_used_tres_run_secs(cpu) is 129600
slurmctld: debug2: acct_policy_job_begin: after adding job 3845366, assoc 1(root/(null)/(null)) grp_used_tres_run_secs(mem) is 305233920000
slurmctld: debug2: acct_policy_job_begin: after adding job 3845366, assoc 1(root/(null)/(null)) grp_used_tres_run_secs(node) is 129600
slurmctld: debug2: acct_policy_job_begin: after adding job 3845366, assoc 1(root/(null)/(null)) grp_used_tres_run_secs(billing) is 129600
slurmctld: debug2: acct_policy_job_begin: after adding job 3845366, assoc 1(root/(null)/(null)) grp_used_tres_run_secs(gres/gpu) is 0
slurmctld: debug2: We have already ran the job_fini for job 3845366
slurmctld: cleanup_completing: job 3845366 completion process took 3921 seconds
Segmentation fault
[root@longleaf-sched ~]# systemctl stop slurmctld
[root@longleaf-sched ~]# systemctl status slurmctld
* slurmctld.service - Slurm controller daemon
   Loaded: loaded (/usr/lib/systemd/system/slurmctld.service; enabled; vendor preset: disabled)
   Active: failed (Result: core-dump) since Wed 2018-12-19 11:49:09 EST; 12min ago
  Process: 80473 ExecStart=/usr/sbin/slurmctld $SLURMCTLD_OPTIONS (code=exited, status=0/SUCCESS)
 Main PID: 80475 (code=dumped, signal=SEGV)

Dec 19 11:49:07 longleaf-sched.its.unc.edu systemd[1]: Starting Slurm controller daemon...
Dec 19 11:49:07 longleaf-sched.its.unc.edu systemd[1]: PID file /var/run/slurmctld.pid not readable (yet?) after start.
Dec 19 11:49:07 longleaf-sched.its.unc.edu systemd[1]: Started Slurm controller daemon.
Dec 19 11:49:09 longleaf-sched.its.unc.edu systemd[1]: slurmctld.service: main process exited, code=dumped, status=11/SEGV
Dec 19 11:49:09 longleaf-sched.its.unc.edu systemd[1]: Unit slurmctld.service entered failed state.
Dec 19 11:49:09 longleaf-sched.its.unc.edu systemd[1]: slurmctld.service failed.

Comment 1 Jenny Williams 2018-12-19 10:34:10 MST

My contact number is 919-923-3987

Comment 2 Jason Booth 2018-12-19 10:42:45 MST

Hi Jenny,

 Would you please attach the output of "thread apply all bt full" from the core file.

You can do this by locating the core file and running:

gdb /path/to/slurmctld corefile
(gdb) thread apply all bt full

-Jason

Comment 3 Jenny Williams 2018-12-19 10:50:15 MST

[root@longleaf-sched slurmctld]# which slurmctld
/sbin/slurmctld
[root@longleaf-sched slurmctld]# gdb /sbin/slurmctld core.110347
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-114.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /usr/sbin/slurmctld...done.
[New LWP 110347]
[New LWP 110348]
[New LWP 110351]
[New LWP 110349]
[New LWP 110350]
[New LWP 110354]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/usr/sbin/slurmctld'.
Program terminated with signal 11, Segmentation fault.
#0  _step_dealloc_lps (step_ptr=0x482e980) at step_mgr.c:2081
2081            i_first = bit_ffs(job_resrcs_ptr->node_bitmap);
Missing separate debuginfos, use: debuginfo-install slurm-slurmctld-17.11.7-1.el7.x86_64
(gdb) thread apply all bt full

Thread 6 (Thread 0x7f2ea3688700 (LWP 110354)):
#0  0x00007f2ea8120d12 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
No symbol table info available.
#1  0x00007f2ea368cbe6 in _my_sleep (usec=30000000) at backfill.c:540
        err = <optimized out>
        nsec = <optimized out>
        sleep_time = 0
        ts = {tv_sec = 1545240145, tv_nsec = 227789000}
        tv1 = {tv_sec = 1545240115, tv_usec = 227789}
        tv2 = {tv_sec = 0, tv_usec = 0}
        __func__ = "_my_sleep"
#2  0x00007f2ea3693002 in backfill_agent (args=<optimized out>) at backfill.c:876
        now = <optimized out>
        wait_time = <optimized out>
        last_backfill_time = 1545240115
        all_locks = {config = READ_LOCK, job = WRITE_LOCK, node = WRITE_LOCK, partition = READ_LOCK, federation = READ_LOCK}
        load_config = <optimized out>
        short_sleep = false
        backfill_cnt = 0
        __func__ = "backfill_agent"
#3  0x00007f2ea811cdd5 in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#4  0x00007f2ea7e45ead in clone () from /lib64/libc.so.6
No symbol table info available.

Thread 5 (Thread 0x7f2ea48a5700 (LWP 110350)):
#0  0x00007f2ea811df47 in pthread_join () from /lib64/libpthread.so.0
No symbol table info available.
#1  0x00007f2ea49aa010 in _cleanup_thread (no_data=<optimized out>) at accounting_storage_slurmdbd.c:445
No locals.
#2  0x00007f2ea811cdd5 in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#3  0x00007f2ea7e45ead in clone () from /lib64/libc.so.6
No symbol table info available.

Thread 4 (Thread 0x7f2ea49a6700 (LWP 110349)):
#0  0x00007f2ea7e0ce2d in nanosleep () from /lib64/libc.so.6
No symbol table info available.
#1  0x00007f2ea7e0ccc4 in sleep () from /lib64/libc.so.6
No symbol table info available.
#2  0x00007f2ea49aa8c8 in _set_db_inx_thread (no_data=<optimized out>) at accounting_storage_slurmdbd.c:437
        local_job_list = <optimized out>
        job_ptr = <optimized out>
        itr = <optimized out>
        job_read_lock = {config = NO_LOCK, job = READ_LOCK, node = NO_LOCK, partition = NO_LOCK, federation = NO_LOCK}
        job_write_lock = {config = NO_LOCK, job = WRITE_LOCK, node = NO_LOCK, partition = NO_LOCK, federation = NO_LOCK}
        __func__ = "_set_db_inx_thread"
#3  0x00007f2ea811cdd5 in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#4  0x00007f2ea7e45ead in clone () from /lib64/libc.so.6
No symbol table info available.

Thread 3 (Thread 0x7f2ea45a0700 (LWP 110351)):
#0  0x00007f2ea812369d in write () from /lib64/libpthread.so.0
No symbol table info available.
#1  0x00007f2ea8600037 in slurm_persist_send_msg (persist_conn=0x232fb30, buffer=buffer@entry=0xa3e0f90) at slurm_persist_conn.c:847
        msg_size = 221
        nw_size = 3707764736
        msg = <optimized out>
        msg_wrote = <optimized out>
        rc = <optimized out>
---Type <return> to continue, or q <return> to quit---
        retry_cnt = 0
#2  0x00007f2ea8677d6c in _agent (x=<optimized out>) at slurmdbd_defs.c:2034
        cnt = <optimized out>
        rc = <optimized out>
        buffer = 0xa3e0f90
        abs_time = {tv_sec = 1545240126, tv_nsec = 0}
        fail_time = 0
        sigarray = {10, 0}
        list_req = {msg_type = 1474, data = 0x7f2ea459feb0}
        list_msg = {my_list = 0x0, return_code = 0}
        __func__ = "_agent"
#3  0x00007f2ea811cdd5 in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#4  0x00007f2ea7e45ead in clone () from /lib64/libc.so.6
No symbol table info available.

Thread 2 (Thread 0x7f2ea8aea700 (LWP 110348)):
#0  0x00007f2ea8120d12 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
No symbol table info available.
#1  0x000000000041e3c7 in _agent_init (arg=<optimized out>) at agent.c:1313
        err = <optimized out>
        min_wait = <optimized out>
        mail_too = <optimized out>
        ts = {tv_sec = 1545240117, tv_nsec = 0}
        __func__ = "_agent_init"
#2  0x00007f2ea811cdd5 in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#3  0x00007f2ea7e45ead in clone () from /lib64/libc.so.6
No symbol table info available.

Thread 1 (Thread 0x7f2ea8aeb740 (LWP 110347)):
#0  _step_dealloc_lps (step_ptr=0x482e980) at step_mgr.c:2081
        cpus_alloc = <optimized out>
        i_last = <optimized out>
        i_node = <optimized out>
        i_first = <optimized out>
        step_node_inx = -1
        job_ptr = 0x482df40
        job_resrcs_ptr = 0x0
        job_node_inx = -1
        cray_simulate_logged = false
#1  post_job_step (step_ptr=step_ptr@entry=0x482e980) at step_mgr.c:4652
        job_ptr = 0x482df40
        error_code = <optimized out>
#2  0x00000000004ab053 in _post_job_step (step_ptr=0x482e980) at step_mgr.c:266
        select_cray_plugin = 0
#3  _internal_step_complete (job_ptr=job_ptr@entry=0x482df40, step_ptr=step_ptr@entry=0x482e980) at step_mgr.c:307
        jobacct = <optimized out>
#4  0x00000000004ab0d1 in delete_step_records (job_ptr=job_ptr@entry=0x482df40) at step_mgr.c:336
        cleaning = 0
        step_iterator = 0x232d420
        step_ptr = 0x482e980
#5  0x0000000000463dd5 in cleanup_completing (job_ptr=job_ptr@entry=0x482df40) at job_scheduler.c:4700
        delay = <optimized out>
        __func__ = "cleanup_completing"
#6  0x000000000046ff7b in deallocate_nodes (job_ptr=job_ptr@entry=0x482df40, timeout=timeout@entry=false, suspended=suspended@entry=false, preempted=preempted@entry=false) at node_scheduler.c:610
        select_serial = 0
        i = <optimized out>
        kill_job = 0xa9691e0
        agent_args = 0xa969500
        down_node_cnt = <optimized out>
        node_ptr = 0x2964d48
---Type <return> to continue, or q <return> to quit---
        __func__ = "deallocate_nodes"
#7  0x000000000049435d in _sync_nodes_to_comp_job () at read_config.c:2404
        job_ptr = 0x482df40
        job_iterator = 0x232d400
        update_cnt = 1
#8  read_slurm_conf (recover=<optimized out>, reconfig=reconfig@entry=false) at read_config.c:1241
        tv1 = {tv_sec = 1545240115, tv_usec = 212445}
        tv2 = {tv_sec = 36751520, tv_usec = 20}
        tv_str = '\000' <repeats 19 times>
        delta_t = 72340172821299456
        error_code = 0
        i = <optimized out>
        rc = <optimized out>
        load_job_ret = 0
        old_node_record_count = <optimized out>
        old_node_table_ptr = <optimized out>
        node_ptr = <optimized out>
        do_reorder_nodes = <optimized out>
        old_part_list = <optimized out>
        old_def_part_name = <optimized out>
        old_auth_type = <optimized out>
        old_bb_type = <optimized out>
        old_checkpoint_type = <optimized out>
        old_crypto_type = <optimized out>
        old_preempt_mode = <optimized out>
        old_preempt_type = 0x29010c0 "preempt/none"
        old_sched_type = <optimized out>
        old_select_type = <optimized out>
        old_switch_type = <optimized out>
        state_save_dir = 0x0
        mpi_params = 0x0
        old_select_type_p = <optimized out>
        __func__ = "read_slurm_conf"
#9  0x00000000004279b6 in main (argc=<optimized out>, argv=<optimized out>) at controller.c:497
        cnt = <optimized out>
        error_code = <optimized out>
        i = 3
        stat_buf = {st_dev = 64768, st_ino = 68778753, st_nlink = 1, st_mode = 33261, st_uid = 0, st_gid = 0, __pad0 = 0, st_rdev = 0, st_size = 392880, st_blksize = 4096, st_blocks = 768, st_atim = {tv_sec = 1545175099, tv_nsec = 500946750}, st_mtim = {
            tv_sec = 1518025023, tv_nsec = 0}, st_ctim = {tv_sec = 1530169016, tv_nsec = 524506863}, __unused = {0, 0, 0}}
        rlim = {rlim_cur = 18446744073709551615, rlim_max = 18446744073709551615}
        config_write_lock = {config = WRITE_LOCK, job = WRITE_LOCK, node = WRITE_LOCK, partition = WRITE_LOCK, federation = NO_LOCK}
        node_part_write_lock = {config = NO_LOCK, job = NO_LOCK, node = WRITE_LOCK, partition = WRITE_LOCK, federation = NO_LOCK}
        callbacks = {acct_full = 0x4ac9f5 <trigger_primary_ctld_acct_full>, dbd_fail = 0x4acc04 <trigger_primary_dbd_fail>, dbd_resumed = 0x4acc92 <trigger_primary_dbd_res_op>, db_fail = 0x4acd17 <trigger_primary_db_fail>, 
          db_resumed = 0x4acda5 <trigger_primary_db_res_op>}
        create_clustername_file = false
        __func__ = "main"
(gdb)

Comment 4 Jenny Williams 2018-12-19 10:51:27 MST

This installation has the state files on a shared network drive; when the network went out my guess is the state was horked.

Comment 5 Jenny Williams 2018-12-19 10:52:50 MST

(In reply to Jenny Williams from comment #1)
> My contact number is 919-923-3987

Updating this to my desk - 919-962-4751

Comment 6 Jenny Williams 2018-12-19 10:53:01 MST

(In reply to Jenny Williams from comment #1)
> My contact number is 919-923-3987

Updating this to my desk - 919-962-4751

Comment 7 Jason Booth 2018-12-19 11:06:22 MST

Created attachment 8709 [details]
patch_to_avoid_segfault

I have attached a patch to help you move past the segfault with slurm-17-11-7-1. Would you please apply this and rebuild the Slurm head node.

cd /your/source
patch -p1 < patch_to_avoid_segfault.patch

Then re-make.

-Jason

Comment 8 Jenny Williams 2018-12-19 11:08:21 MST

Will work on that. I assume this will recover pending jobs but running jobs will have been terminated by now due to slurmd timeout – is that correct?
J

From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Wednesday, December 19, 2018 1:06 PM
To: Williams, Jenny Avis <jennyw@email.unc.edu>
Subject: [Bug 6270] after core network outage slurmctld will not start - fails with segfault

Comment # 7<https://bugs.schedmd.com/show_bug.cgi?id=6270#c7> on bug 6270<https://bugs.schedmd.com/show_bug.cgi?id=6270> from Jason Booth<mailto:jbooth@schedmd.com>

Created attachment 8709 [details]<attachment.cgi?id=8709&action=diff> [details]<attachment.cgi?id=8709&action=edit>

patch_to_avoid_segfault



I have attached a patch to help you move past the segfault with

slurm-17-11-7-1. Would you please apply this and rebuild the Slurm head node.



cd /your/source

patch -p1 < patch_to_avoid_segfault.patch



Then re-make.



-Jason

________________________________
You are receiving this mail because:

  *   You reported the bug.

Comment 10 Jason Booth 2018-12-19 11:22:16 MST

Jenny

> Will work on that. I assume this will recover pending jobs but running jobs will have been terminated by now due to slurmd timeout – is that correct?


The issue only affects slurmctld and should not trigger the slurmd timeout code you mention. Your jobs on the slurmds should not be affected unless the slurmctld has lost all record of them, in that case, they will be canceled. 

-Jason

Comment 11 Jenny Williams 2018-12-19 11:42:24 MST

Close ... started it again after changing python versions - but a hint as to what python it wants would be appreciated

make[4]: Nothing to be done for `all-am'.
make[4]: Leaving directory `/nas/longleaf/apps/slurm/rpmbuild/BUILD/slurm-17.11.7/testsuite/slurm_unit'
make[3]: Leaving directory `/nas/longleaf/apps/slurm/rpmbuild/BUILD/slurm-17.11.7/testsuite/slurm_unit'
make[3]: Entering directory `/nas/longleaf/apps/slurm/rpmbuild/BUILD/slurm-17.11.7/testsuite'
make[3]: Nothing to be done for `all-am'.
make[3]: Leaving directory `/nas/longleaf/apps/slurm/rpmbuild/BUILD/slurm-17.11.7/testsuite'
make[2]: Leaving directory `/nas/longleaf/apps/slurm/rpmbuild/BUILD/slurm-17.11.7/testsuite'
Making all in doc
make[2]: Entering directory `/nas/longleaf/apps/slurm/rpmbuild/BUILD/slurm-17.11.7/doc'
Making all in man
make[3]: Entering directory `/nas/longleaf/apps/slurm/rpmbuild/BUILD/slurm-17.11.7/doc/man'
Making all in man1
make[4]: Entering directory `/nas/longleaf/apps/slurm/rpmbuild/BUILD/slurm-17.11.7/doc/man/man1'
`dirname sacct.1`/../man2html.py 17.11 ./../../html/header.txt ./../../html/footer.txt sacct.1
`dirname sacctmgr.1`/../man2html.py 17.11 ./../../html/header.txt ./../../html/footer.txt sacctmgr.1
`dirname salloc.1`/../man2html.py 17.11 ./../../html/header.txt ./../../html/footer.txt salloc.1
Fatal Python error: Py_Initialize: Unable to get the locale encoding
Fatal Python error: Py_Initialize: Unable to get the locale encoding
  File "/nas/longleaf/apps/python/2.7.12/lib/python2.7/encodings/__init__.py", line 123
      File "/nas/longleaf/apps/python/2.7.12/lib/python2.7/encodings/__init__.py", line 123
raise CodecRegistryError,\
        raise CodecRegistryError,\
                                                   ^
 ^
SyntaxError: SyntaxErrorinvalid syntax: 
invalid syntax


Current thread 0xCurrent thread 0x00007f238323c74000007f9853a00740 (most recent call first):
 (most recent call first):
Fatal Python error: Py_Initialize: Unable to get the locale encoding
  File "/nas/longleaf/apps/python/2.7.12/lib/python2.7/encodings/__init__.py", line 123
    raise CodecRegistryError,\
                            ^
SyntaxError: invalid syntax

Current thread 0x00007f6163e79740 (most recent call first):
make[4]: *** [sacct.html] Aborted
make[4]: *** Waiting for unfinished jobs....
make[4]: *** [salloc.html] Aborted
make[4]: *** [sacctmgr.html] Aborted
make[4]: Leaving directory `/nas/longleaf/apps/slurm/rpmbuild/BUILD/slurm-17.11.7/doc/man/man1'
make[3]: *** [all-recursive] Error 1
make[3]: Leaving directory `/nas/longleaf/apps/slurm/rpmbuild/BUILD/slurm-17.11.7/doc/man'
make[2]: *** [all-recursive] Error 1
make[2]: Leaving directory `/nas/longleaf/apps/slurm/rpmbuild/BUILD/slurm-17.11.7/doc'
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory `/nas/longleaf/apps/slurm/rpmbuild/BUILD/slurm-17.11.7'
make: *** [all] Error 2
error: Bad exit status from /var/tmp/rpm-tmp.x5gG0X (%build)


RPM build errors:
    Bad exit status from /var/tmp/rpm-tmp.x5gG0X (%build)

^C
[jennyw@longleaf-sched rpmbuild]$ which python
/nas/longleaf/apps/python/2.7.12/bin/python
[jennyw@longleaf-sched rpmbuild]$ module list

Currently Loaded Modules:
  1) python/2.7.12

Comment 12 Jenny Williams 2018-12-19 11:56:02 MST

It is up.

Wow. Thank you.

Any follow on to this ?

Comment 13 Jason Booth 2018-12-19 12:03:11 MST

Jenny,

 What version of RHEL is this, EL6. Also, what did you change to resolve your python issue (PYTHONPATH)? 
  
-Jason

Comment 14 Jenny Williams 2018-12-19 12:08:02 MST

RHEL 7.6


I moved to a "build" node; I had been running the compile on a node that may not have had all the prerequisites. The python version that ended up working for me was 3.6.3

Comment 15 Jenny Williams 2018-12-19 12:09:16 MST

PYTHONPATH was empty - no value for PYTHONPATH

Comment 16 Jason Booth 2018-12-20 15:32:47 MST

Jenny,

 Just a quick update on this other issue. I could investigate this further if you attach the slurmctld logs from the day of the crash. 

-Jason

Comment 17 Jason Booth 2018-12-26 09:32:59 MST

Hi Jenny,

 I am marking this issue resolved. If you need any further assistance with this please feel free to re-open this case.