On Slurm 23.02.1, slurmctld segfaults when loading the example job_submit.lua script on SLES 15 SP4. We installed the slurm-example-configs RPM, copied /etc/slurm/job_submit.lua.example to /etc/slurm/job_submit.lua, and set JobSubmitPlugins=lua in slurm.conf. Then we get this failure from slurmctld: nid000008:~ # systemctl status slurmctld × slurmctld.service - Slurm controller daemon Loaded: loaded (/usr/lib/systemd/system/slurmctld.service; disabled; vendor preset: disabled) Active: failed (Result: signal) since Mon 2023-05-08 20:06:04 UTC; 9min ago Process: 129884 ExecStart=/usr/sbin/slurmctld -D -s $SLURMCTLD_OPTIONS (code=killed, signal=SEGV) Main PID: 129884 (code=killed, signal=SEGV) May 08 20:06:04 nid000008 slurmctld[129884]: slurmctld: select/linear: init: Linear node selection plugin > May 08 20:06:04 nid000008 slurmctld[129884]: slurmctld: preempt/none: init: preempt/none loaded May 08 20:06:04 nid000008 slurmctld[129884]: slurmctld: debug: acct_gather_energy/none: init: AcctGatherE> May 08 20:06:04 nid000008 slurmctld[129884]: slurmctld: debug: acct_gather_profile/none: init: AcctGather> May 08 20:06:04 nid000008 slurmctld[129884]: slurmctld: debug: acct_gather_interconnect/none: init: AcctG> May 08 20:06:04 nid000008 slurmctld[129884]: slurmctld: debug: acct_gather_filesystem/none: init: AcctGat> May 08 20:06:04 nid000008 slurmctld[129884]: slurmctld: debug: jobacct_gather/none: init: Job accounting > May 08 20:06:04 nid000008 slurmctld[129885]: slurmctld: debug: _slurmscriptd_mainloop: finished May 08 20:06:04 nid000008 systemd[1]: slurmctld.service: Main process exited, code=killed, status=11/SEGV May 08 20:06:04 nid000008 systemd[1]: slurmctld.service: Failed with result 'signal'. (gdb) bt #0 0x00007f0af2345cbf in ?? () from /usr/lib64/liblua5.3.so.5 #1 0x00007f0af233765e in lua_setglobal () from /usr/lib64/liblua5.3.so.5 #2 0x00007f0af256aaf7 in _register_local_output_functions (L=0x72fc10) at job_submit_lua.c:1287 #3 _loadscript_extra (st=st@entry=0x72fc10) at job_submit_lua.c:1312 #4 0x00007f0af2570ce5 in slurm_lua_loadscript (L=L@entry=0x7f0af2773310 <L>, plugin=plugin@entry=0x7f0af257175e "job_submit/lua", script_path=0x745a80 "/etc/slurm/job_submit.lua", req_fxns=req_fxns@entry=0x7f0af2773270 <req_fxns>, load_time=load_time@entry=0x7f0af2773318 <lua_script_last_loaded>, local_options=local_options@entry=0x7f0af256aa6d <_loadscript_extra>) at slurm_lua.c:711 #5 0x00007f0af256e859 in init () at job_submit_lua.c:1329 #6 0x00007f0af999240d in plugin_load_from_file (p=p@entry=0x7ffdfc1433e0, fq_path=<optimized out>) at plugin.c:207 #7 0x00007f0af99927d9 in plugin_load_and_link (type_name=<optimized out>, n_syms=n_syms@entry=2, names=names@entry=0x719710 <syms>, ptrs=ptrs@entry=0x74dd50) at plugin.c:267 #8 0x00007f0af99929a6 in plugin_context_create (plugin_type=plugin_type@entry=0x5048d1 "job_submit", uler_type=0x72f190 "job_submit/lua", ptrs=0x74dd50, names=names@entry=0x719710 <syms>, names_size=names_size@entry=16) at plugin.c:425 #9 0x00000000004dd356 in job_submit_g_init (locked=locked@entry=false) at job_submit.c:112 #10 0x00000000004323b0 in main (argc=<optimized out>, argv=<optimized out>) at controller.c:572
Hi, Just for testing purposes, can you please try restarting the controller? with the slurmctld command? And you confirm that without adding the lua script there a re no issues? Thanks
Running slurmctld manually still fails. nid000008:~ # /usr/sbin/slurmctld -D -s slurmctld: debug: slurmctld log levels: stderr=debug logfile=debug syslog=quiet slurmctld: debug: Log file re-opened slurmctld: pidfile not locked, assuming no running daemon slurmctld: error: Configured MailProg is invalid slurmctld: debug: slurmscriptd: Got ack from slurmctld slurmctld: select/cons_res: common_init: select/cons_res loaded slurmctld: select/cons_tres: common_init: select/cons_tres loaded slurmctld: select/cray_aries: init: Cray/Aries node selection plugin loaded slurmctld: select/linear: init: Linear node selection plugin loaded with argument 20 slurmctld: debug: Initialization successful slurmctld: debug: _slurmscriptd_mainloop: started slurmctld: debug: slurmctld: slurmscriptd fork()'d and initialized. slurmctld: debug: _slurmctld_listener_thread: started listening to slurmscriptd slurmctld: slurmctld version 23.02.1 started on cluster sawmill slurmctld: cred/munge: init: Munge credential signature plugin loaded slurmctld: accounting_storage/none: init: Accounting storage NOT INVOKED plugin loaded slurmctld: select/cons_res: common_init: select/cons_res loaded slurmctld: select/cons_tres: common_init: select/cons_tres loaded slurmctld: select/cray_aries: init: Cray/Aries node selection plugin loaded slurmctld: select/linear: init: Linear node selection plugin loaded with argument 20 slurmctld: preempt/none: init: preempt/none loaded slurmctld: debug: acct_gather_energy/none: init: AcctGatherEnergy NONE plugin loaded slurmctld: debug: acct_gather_profile/none: init: AcctGatherProfile NONE plugin loaded slurmctld: debug: acct_gather_interconnect/none: init: AcctGatherInterconnect NONE plugin loaded slurmctld: debug: acct_gather_filesystem/none: init: AcctGatherFilesystem NONE plugin loaded slurmctld: debug: jobacct_gather/none: init: Job accounting gather NOT_INVOKED plugin loaded slurmctld: debug: _slurmscriptd_mainloop: finished Segmentation fault If I comment out JobSubmitPlugins=lua from slurm.conf, slurmctld starts fine. nid000008:~ # systemctl status slurmctld ● slurmctld.service - Slurm controller daemon Loaded: loaded (/usr/lib/systemd/system/slurmctld.service; disabled; vendor preset: disabled) Active: active (running) since Tue 2023-05-09 15:58:29 UTC; 55s ago Main PID: 113516 (slurmctld) Tasks: 11 CGroup: /system.slice/slurmctld.service ├─ 113516 /usr/sbin/slurmctld -D -s └─ 113517 "slurmctld: slurmscriptd" "" "" May 09 15:58:29 nid000008 slurmctld[113516]: slurmctld: debug: No backup controllers, not launching heart> May 09 15:58:29 nid000008 slurmctld[113516]: slurmctld: No parameter for mcs plugin, default values set May 09 15:58:29 nid000008 slurmctld[113516]: slurmctld: mcs: MCSParameters = (null). ondemand set. May 09 15:58:29 nid000008 slurmctld[113516]: slurmctld: debug: mcs/none: init: mcs none plugin loaded May 09 15:58:33 nid000008 slurmctld[113516]: slurmctld: debug: Spawning registration agent for nid[000008> May 09 15:58:44 nid000008 slurmctld[113516]: slurmctld: agent/is_node_resp: node:nid000008 RPC:REQUEST_NOD> May 09 15:58:44 nid000008 slurmctld[113516]: slurmctld: agent/is_node_resp: node:nid000009 RPC:REQUEST_NOD> May 09 15:58:45 nid000008 slurmctld[113516]: slurmctld: error: Nodes nid[000008-000009] not responding May 09 15:58:59 nid000008 slurmctld[113516]: slurmctld: debug: sched/backfill: _attempt_backfill: beginni> May 09 15:58:59 nid000008 slurmctld[113516]: slurmctld: debug: sched/backfill: _attempt_backfill: no jobs>
Thanks for sharing that. Can you please attach a copy of your slurm.conf? Regards
Created attachment 30187 [details] slurm.conf file
Also, did you follow the setup instructions listed here: https://slurm.schedmd.com/job_submit_plugins.html ? Thanks
We just copied the example job_submit plugin, assuming that the developer of the example plugin followed that guide.
Ok, can you tell me what lua rpm you have installed? Thanks
nid000008:~ # rpm -qa | grep lua lua-macros-20170611-1.152.noarch liblua5_3-5-5.3.6-3.6.1.x86_64 liblua5_1-5-5.1.5-2.31.x86_64 lua51-devel-5.1.5-2.31.x86_64 lua51-5.1.5-2.31.x86_64
From what I can gather, you built with lua 5.1 headers, but the lua 5.3 library is being loaded. You should most likely remove lua5.1-devel and install lua5.3-devel. Please let me know if that helps. Thanks
Hi After further reflection and analysis, I think you would benefit from installing the latest lua version altogether and making sure the latest development files are installed. If you haven't already done so, try holding off from removing the 5.1 version for now. Thanks
Hi, Did you manage to make any headway regarding this issue? Thanks
The development system where we originally hit this issue is currently busy. Once it's available again we plan on installing lua53 and lua53-devel instead of lua51 and lua51-devel.
Hi, Ok sounds good. I will keep this ticket open if you'd like until the issue is resolved. Unless you foresee this taking some time, in that case we can close and you can re-open if you experience more issues. Thanks
Installing lua53 and lua53-devel fixed this problem, so I think this can be closed.
Hi, That's great! I'm glad you got it resolved. Have a great day