| Summary: | slurmctld segfault loading example job_submit.lua script | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | David Gloe <david.gloe> |
| Component: | slurmctld | Assignee: | Director of Support <support> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | CC: | benny, mcmullan |
| Version: | 23.02.1 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | CRAY | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | Cray Internal |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: | slurm.conf file | ||
Hi, Just for testing purposes, can you please try restarting the controller? with the slurmctld command? And you confirm that without adding the lua script there a re no issues? Thanks Running slurmctld manually still fails.
nid000008:~ # /usr/sbin/slurmctld -D -s
slurmctld: debug: slurmctld log levels: stderr=debug logfile=debug syslog=quiet
slurmctld: debug: Log file re-opened
slurmctld: pidfile not locked, assuming no running daemon
slurmctld: error: Configured MailProg is invalid
slurmctld: debug: slurmscriptd: Got ack from slurmctld
slurmctld: select/cons_res: common_init: select/cons_res loaded
slurmctld: select/cons_tres: common_init: select/cons_tres loaded
slurmctld: select/cray_aries: init: Cray/Aries node selection plugin loaded
slurmctld: select/linear: init: Linear node selection plugin loaded with argument 20
slurmctld: debug: Initialization successful
slurmctld: debug: _slurmscriptd_mainloop: started
slurmctld: debug: slurmctld: slurmscriptd fork()'d and initialized.
slurmctld: debug: _slurmctld_listener_thread: started listening to slurmscriptd
slurmctld: slurmctld version 23.02.1 started on cluster sawmill
slurmctld: cred/munge: init: Munge credential signature plugin loaded
slurmctld: accounting_storage/none: init: Accounting storage NOT INVOKED plugin loaded
slurmctld: select/cons_res: common_init: select/cons_res loaded
slurmctld: select/cons_tres: common_init: select/cons_tres loaded
slurmctld: select/cray_aries: init: Cray/Aries node selection plugin loaded
slurmctld: select/linear: init: Linear node selection plugin loaded with argument 20
slurmctld: preempt/none: init: preempt/none loaded
slurmctld: debug: acct_gather_energy/none: init: AcctGatherEnergy NONE plugin loaded
slurmctld: debug: acct_gather_profile/none: init: AcctGatherProfile NONE plugin loaded
slurmctld: debug: acct_gather_interconnect/none: init: AcctGatherInterconnect NONE plugin loaded
slurmctld: debug: acct_gather_filesystem/none: init: AcctGatherFilesystem NONE plugin loaded
slurmctld: debug: jobacct_gather/none: init: Job accounting gather NOT_INVOKED plugin loaded
slurmctld: debug: _slurmscriptd_mainloop: finished
Segmentation fault
If I comment out JobSubmitPlugins=lua from slurm.conf, slurmctld starts fine.
nid000008:~ # systemctl status slurmctld
● slurmctld.service - Slurm controller daemon
Loaded: loaded (/usr/lib/systemd/system/slurmctld.service; disabled; vendor preset: disabled)
Active: active (running) since Tue 2023-05-09 15:58:29 UTC; 55s ago
Main PID: 113516 (slurmctld)
Tasks: 11
CGroup: /system.slice/slurmctld.service
├─ 113516 /usr/sbin/slurmctld -D -s
└─ 113517 "slurmctld: slurmscriptd" "" ""
May 09 15:58:29 nid000008 slurmctld[113516]: slurmctld: debug: No backup controllers, not launching heart>
May 09 15:58:29 nid000008 slurmctld[113516]: slurmctld: No parameter for mcs plugin, default values set
May 09 15:58:29 nid000008 slurmctld[113516]: slurmctld: mcs: MCSParameters = (null). ondemand set.
May 09 15:58:29 nid000008 slurmctld[113516]: slurmctld: debug: mcs/none: init: mcs none plugin loaded
May 09 15:58:33 nid000008 slurmctld[113516]: slurmctld: debug: Spawning registration agent for nid[000008>
May 09 15:58:44 nid000008 slurmctld[113516]: slurmctld: agent/is_node_resp: node:nid000008 RPC:REQUEST_NOD>
May 09 15:58:44 nid000008 slurmctld[113516]: slurmctld: agent/is_node_resp: node:nid000009 RPC:REQUEST_NOD>
May 09 15:58:45 nid000008 slurmctld[113516]: slurmctld: error: Nodes nid[000008-000009] not responding
May 09 15:58:59 nid000008 slurmctld[113516]: slurmctld: debug: sched/backfill: _attempt_backfill: beginni>
May 09 15:58:59 nid000008 slurmctld[113516]: slurmctld: debug: sched/backfill: _attempt_backfill: no jobs>
Thanks for sharing that. Can you please attach a copy of your slurm.conf? Regards Created attachment 30187 [details]
slurm.conf file
Also, did you follow the setup instructions listed here: https://slurm.schedmd.com/job_submit_plugins.html ? Thanks We just copied the example job_submit plugin, assuming that the developer of the example plugin followed that guide. Ok, can you tell me what lua rpm you have installed? Thanks nid000008:~ # rpm -qa | grep lua lua-macros-20170611-1.152.noarch liblua5_3-5-5.3.6-3.6.1.x86_64 liblua5_1-5-5.1.5-2.31.x86_64 lua51-devel-5.1.5-2.31.x86_64 lua51-5.1.5-2.31.x86_64 From what I can gather, you built with lua 5.1 headers, but the lua 5.3 library is being loaded. You should most likely remove lua5.1-devel and install lua5.3-devel. Please let me know if that helps. Thanks Hi After further reflection and analysis, I think you would benefit from installing the latest lua version altogether and making sure the latest development files are installed. If you haven't already done so, try holding off from removing the 5.1 version for now. Thanks Hi, Did you manage to make any headway regarding this issue? Thanks The development system where we originally hit this issue is currently busy. Once it's available again we plan on installing lua53 and lua53-devel instead of lua51 and lua51-devel. Hi, Ok sounds good. I will keep this ticket open if you'd like until the issue is resolved. Unless you foresee this taking some time, in that case we can close and you can re-open if you experience more issues. Thanks Installing lua53 and lua53-devel fixed this problem, so I think this can be closed. Hi, That's great! I'm glad you got it resolved. Have a great day |
On Slurm 23.02.1, slurmctld segfaults when loading the example job_submit.lua script on SLES 15 SP4. We installed the slurm-example-configs RPM, copied /etc/slurm/job_submit.lua.example to /etc/slurm/job_submit.lua, and set JobSubmitPlugins=lua in slurm.conf. Then we get this failure from slurmctld: nid000008:~ # systemctl status slurmctld × slurmctld.service - Slurm controller daemon Loaded: loaded (/usr/lib/systemd/system/slurmctld.service; disabled; vendor preset: disabled) Active: failed (Result: signal) since Mon 2023-05-08 20:06:04 UTC; 9min ago Process: 129884 ExecStart=/usr/sbin/slurmctld -D -s $SLURMCTLD_OPTIONS (code=killed, signal=SEGV) Main PID: 129884 (code=killed, signal=SEGV) May 08 20:06:04 nid000008 slurmctld[129884]: slurmctld: select/linear: init: Linear node selection plugin > May 08 20:06:04 nid000008 slurmctld[129884]: slurmctld: preempt/none: init: preempt/none loaded May 08 20:06:04 nid000008 slurmctld[129884]: slurmctld: debug: acct_gather_energy/none: init: AcctGatherE> May 08 20:06:04 nid000008 slurmctld[129884]: slurmctld: debug: acct_gather_profile/none: init: AcctGather> May 08 20:06:04 nid000008 slurmctld[129884]: slurmctld: debug: acct_gather_interconnect/none: init: AcctG> May 08 20:06:04 nid000008 slurmctld[129884]: slurmctld: debug: acct_gather_filesystem/none: init: AcctGat> May 08 20:06:04 nid000008 slurmctld[129884]: slurmctld: debug: jobacct_gather/none: init: Job accounting > May 08 20:06:04 nid000008 slurmctld[129885]: slurmctld: debug: _slurmscriptd_mainloop: finished May 08 20:06:04 nid000008 systemd[1]: slurmctld.service: Main process exited, code=killed, status=11/SEGV May 08 20:06:04 nid000008 systemd[1]: slurmctld.service: Failed with result 'signal'. (gdb) bt #0 0x00007f0af2345cbf in ?? () from /usr/lib64/liblua5.3.so.5 #1 0x00007f0af233765e in lua_setglobal () from /usr/lib64/liblua5.3.so.5 #2 0x00007f0af256aaf7 in _register_local_output_functions (L=0x72fc10) at job_submit_lua.c:1287 #3 _loadscript_extra (st=st@entry=0x72fc10) at job_submit_lua.c:1312 #4 0x00007f0af2570ce5 in slurm_lua_loadscript (L=L@entry=0x7f0af2773310 <L>, plugin=plugin@entry=0x7f0af257175e "job_submit/lua", script_path=0x745a80 "/etc/slurm/job_submit.lua", req_fxns=req_fxns@entry=0x7f0af2773270 <req_fxns>, load_time=load_time@entry=0x7f0af2773318 <lua_script_last_loaded>, local_options=local_options@entry=0x7f0af256aa6d <_loadscript_extra>) at slurm_lua.c:711 #5 0x00007f0af256e859 in init () at job_submit_lua.c:1329 #6 0x00007f0af999240d in plugin_load_from_file (p=p@entry=0x7ffdfc1433e0, fq_path=<optimized out>) at plugin.c:207 #7 0x00007f0af99927d9 in plugin_load_and_link (type_name=<optimized out>, n_syms=n_syms@entry=2, names=names@entry=0x719710 <syms>, ptrs=ptrs@entry=0x74dd50) at plugin.c:267 #8 0x00007f0af99929a6 in plugin_context_create (plugin_type=plugin_type@entry=0x5048d1 "job_submit", uler_type=0x72f190 "job_submit/lua", ptrs=0x74dd50, names=names@entry=0x719710 <syms>, names_size=names_size@entry=16) at plugin.c:425 #9 0x00000000004dd356 in job_submit_g_init (locked=locked@entry=false) at job_submit.c:112 #10 0x00000000004323b0 in main (argc=<optimized out>, argv=<optimized out>) at controller.c:572