| Summary: | slurmrestd segfault | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Marco Induni <marco.induni> |
| Component: | slurmrestd | Assignee: | Marcin Stolarek <cinek> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | CC: | cinek, nate |
| Version: | 20.11.4 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | CSCS - Swiss National Supercomputing Centre | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | 20.11.8 | Target Release: | --- |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
|
Description
Marco Induni
2021-04-06 09:34:45 MDT
Marco, I'm looking into the case, but it didn't crash for on simple reproduction attempt. Are you able to get the backtrace out of the core dump? cheers, Marcin Hi Marcin > > I'm looking into the case, but it didn't crash for on simple reproduction > attempt. Are you able to get the backtrace out of the core dump? The segfault doesn't generate any core-dump. Here below the out of the program started from gdb with the bt command started right after the crash. Kind regards Marco [root]# gdb /opt/slurm/20.11.4/sbin/slurmrestd GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-120.el7 Copyright (C) 2013 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-redhat-linux-gnu". For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>... Reading symbols from /opt/slurm/20.11.4/sbin/slurmrestd...done. (gdb) set args -vv 0.0.0.0:6821 (gdb) run Starting program: /opt/slurm/20.11.4/sbin/slurmrestd -vv 0.0.0.0:6821 [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". slurmrestd: debug: slurm_conf_init: using config_file=/opt/slurm/20.11.4/etc/slurm.conf slurmrestd: debug: Reading slurm.conf file: /opt/slurm/20.11.4/etc/slurm.conf slurmrestd: debug: Ignoring obsolete CacheGroups option. slurmrestd: debug: Ignoring obsolete SchedulerPort option. [New Thread 0x7ffff7fe7700 (LWP 32370)] [New Thread 0x7ffff66d5700 (LWP 32371)] [New Thread 0x7ffff65d4700 (LWP 32372)] [New Thread 0x7ffff64d3700 (LWP 32373)] [New Thread 0x7ffff63d2700 (LWP 32374)] [New Thread 0x7ffff62d1700 (LWP 32375)] [New Thread 0x7ffff61d0700 (LWP 32376)] [New Thread 0x7ffff60cf700 (LWP 32377)] [New Thread 0x7ffff5fce700 (LWP 32378)] [New Thread 0x7ffff5ecd700 (LWP 32379)] [New Thread 0x7ffff5dcc700 (LWP 32380)] [New Thread 0x7ffff5ccb700 (LWP 32381)] [New Thread 0x7ffff5bca700 (LWP 32382)] [New Thread 0x7ffff5ac9700 (LWP 32383)] [New Thread 0x7ffff59c8700 (LWP 32384)] [New Thread 0x7ffff58c7700 (LWP 32385)] [New Thread 0x7ffff57c6700 (LWP 32386)] [New Thread 0x7ffff56c5700 (LWP 32387)] [New Thread 0x7ffff55c4700 (LWP 32388)] [New Thread 0x7ffff54c3700 (LWP 32389)] slurmrestd: debug: Interactive mode activated (TTY detected on STDIN) slurmrestd: debug: main: server listen mode activated slurmrestd: debug: auth/jwt: init: JWT authentication plugin loaded slurmrestd: debug: parse_http: [[montebar.cscs.ch]:45040] Accepted HTTP connection slurmrestd: debug: _on_url: [[montebar.cscs.ch]:45040] url path: /slurmdb/v0.0.36/jobs query: (null) slurmrestd: operations_router: [[montebar.cscs.ch]:45040] POST /slurmdb/v0.0.36/jobs slurmrestd: debug: rest_auth/local: slurm_rest_auth_p_authenticate: slurm_rest_auth_p_authenticate: [[montebar.cscs.ch]:45040] socket authentication only supported on UNIX sockets slurmrestd: accounting_storage/slurmdbd: init: Accounting storage SLURMDBD plugin loaded slurmrestd: error: slurm_persist_conn_open: Something happened with the receiving/processing of the persistent connection init message to fojorina06:6819: Failed to unpack SLURM_PERSIST_INIT message slurmrestd: error: slurm_persist_conn_open_without_init: failed to open persistent connection to host:fojorina07:6819: Connection refused slurmrestd: error: Sending PersistInit msg: Connection refused slurmrestd: error: slurm_rest_auth_p_get_db_conn: unable to connect to slurmdbd: Connection refused Program received signal SIGSEGV, Segmentation fault. [Switching to Thread 0x7ffff5ac9700 (LWP 32383)] dbd_conn_send_recv_direct (rpc_version=rpc_version@entry=9216, req=req@entry=0x7ffff5ac8890, resp=resp@entry=0x7ffff5ac8870) at dbd_conn.c:262 262 if (use_conn->fd < 0) { Missing separate debuginfos, use: debuginfo-install bzip2-libs-1.0.6-13.el7.x86_64 elfutils-libelf-0.176-5.el7.x86_64 elfutils-libs-0.176-5.el7.x86_64 glibc-2.17-323.el7_9.x86_64 http-parser-2.7.1-9.el7.x86_64 jansson-2.10-1.el7.x86_64 json-c-0.11-4.el7_0.x86_64 keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.15.1-50.el7.x86_64 libattr-2.4.46-13.el7.x86_64 libcap-2.22-11.el7.x86_64 libcom_err-1.42.9-19.el7.x86_64 libgcc-4.8.5-44.el7.x86_64 libselinux-2.5-15.el7.x86_64 openssl-libs-1.0.2k-21.el7_9.x86_64 pcre-8.32-17.el7.x86_64 systemd-libs-219-78.el7_9.3.x86_64 xz-libs-5.2.2-1.el7.x86_64 zlib-1.2.7-19.el7_9.x86_64 (gdb) bt #0 dbd_conn_send_recv_direct (rpc_version=rpc_version@entry=9216, req=req@entry=0x7ffff5ac8890, resp=resp@entry=0x7ffff5ac8870) at dbd_conn.c:262 #1 0x00007ffff116a65b in dbd_conn_send_recv (rpc_version=rpc_version@entry=9216, req=req@entry=0x7ffff5ac8890, resp=resp@entry=0x7ffff5ac8870) at dbd_conn.c:377 #2 0x00007ffff1168ea0 in jobacct_storage_p_get_jobs_cond (db_conn=<optimized out>, uid=<optimized out>, job_cond=<optimized out>) at accounting_storage_slurmdbd.c:2945 #3 0x00007ffff7ada9de in jobacct_storage_g_get_jobs_cond (db_conn=db_conn@entry=0x0, uid=0, job_cond=job_cond@entry=0x0) at slurm_accounting_storage.c:957 #4 0x00007ffff7a697f8 in slurmdb_jobs_get (db_conn=0x0, job_cond=job_cond@entry=0x0) at job_functions.c:70 #5 0x00007ffff4969f09 in db_query_list_funcname (errors=errors@entry=0x7fffd80081a0, auth=auth@entry=0x7fffdc000960, list=list@entry=0x7ffff5ac8978, func=0x7ffff7a697bd <slurmdb_jobs_get>, cond=cond@entry=0x0, func_name=func_name@entry=0x7ffff49700c5 "slurmdb_jobs_get") at api.c:156 #6 0x00007ffff496b1f0 in _dump_jobs (context_id=context_id@entry=0x7fffec002b80 "[montebar.cscs.ch]:45040", method=method@entry=HTTP_REQUEST_POST, parameters=parameters@entry=0x7fffd8000b40, query=query@entry=0x7fffd8006890, tag=tag@entry=0, resp=resp@entry=0x7fffd80062a0, auth=auth@entry=0x7fffdc000960, errors=errors@entry=0x7fffd80081a0, job_cond=job_cond@entry=0x0) at jobs.c:359 #7 0x00007ffff496b41d in op_handler_jobs (context_id=0x7fffec002b80 "[montebar.cscs.ch]:45040", method=HTTP_REQUEST_POST, parameters=parameters@entry=0x7fffd8000b40, query=query@entry=0x7fffd8006890, tag=tag@entry=0, resp=resp@entry=0x7fffd80062a0, auth=0x7fffdc000960) at jobs.c:404 #8 0x000000000040e8d0 in _call_handler (write_mime=MIME_JSON, callback_tag=0, callback=0x7ffff496b300 <op_handler_jobs>, query=0x7fffd8006890, params=0x7fffd8000b40, args=0x7ffff5ac8cb0) at operations.c:371 #9 operations_router (args=0x7ffff5ac8cb0) at operations.c:472 #10 0x000000000040a7a0 in _on_message_complete_request (request=0x7fffdc000990, method=<optimized out>, parser=0x7fffdc0008d0) at http.c:645 #11 _on_message_complete (parser=0x7fffdc0008d0) at http.c:717 #12 0x00007ffff75fd367 in http_parser_execute () from /lib64/libhttp_parser.so.2 #13 0x0000000000409e2f in parse_http (con=0x7fffec0009b0, x=<optimized out>) at http.c:809 #14 0x0000000000406821 in _wrap_on_data (x=0x7fffec0009b0) at conmgr.c:820 #15 0x00000000004059ff in _wrap_work (x=<optimized out>) at conmgr.c:577 #16 0x00007ffff7b651e7 in _worker (arg=0x632530) at workq.c:306 #17 0x00007ffff6cbeea5 in start_thread () from /lib64/libpthread.so.0 #18 0x00007ffff69e79fd in clone () from /lib64/libc.so.6 Marco,
I did reproduce the issue. The reason for the failure is that slurmrestd cannot connect (gets filtered by firewall?) reaching out to slurmdbd? Are you able to:
>#telnet fojorina07 6819
from the slurmretd host? Can you confirm that it works as expected when the network connectivity isseue is resolved?
Nevertheless, slurmrestd should not segfault. I'll work on the fix and keep you posted on the progress.
cheers,
Marcin
Marcin,
thank you for your answer and your investigations
> Are you able to:
> >#telnet fojorina07 6819
> from the slurmretd host? Can you confirm that it works as expected when the
> network connectivity isseue is resolved?
So the slurmrestd host (fojorina06) is also running slurmdbd, and fojorina07 is the backup host here the 2 lines of config from slurm.conf
AccountingStorageHost=fojorina06
AccountingStorageBackupHost=fojorina07
I can connect from the slurmrest fojorina06 so to myself
[root@fojorina06 opt]# telnet fojorina06 6819
Trying 148.187.18.51...
Connected to fojorina06.
Escape character is '^]'.
^]
telnet> quit
Connection closed.
but not the backup host
[root@fojorina06 opt]# telnet fojorina07 6819
Trying 148.187.18.52...
telnet: connect to address 148.187.18.52: Connection refused
In any case the connection message seems to appear already on the first connection try to fojorina06
Apr 6 17:19:30 fojorina06 slurmrestd[9041]: error: slurm_persist_conn_open: Something happened with the receiving/processing of the persistent connection init message to fojorina06:6819: Failed to unpack SLURM_PERSIST_INIT message
Anyway I removed the AccountingStorageBackupHost from the slurm.conf and now the problem shows up only on the first host (fojorina06)
Here some additional info
[root@fojorina06 opt]# host fojorina06
fojorina06.cscs.ch has address 148.187.18.51
[root@fojorina06 opt]# netstat -anop | grep LISTE | grep 6819
tcp 0 0 0.0.0.0:6819 0.0.0.0:* LISTEN 11797/slurmdbd off (0.00/0/0)
Marco
Marco, Did slurmrestd segfault after removal of backup host? Could you please check slurmdbd logs for the time when slurmrestd->slurmdbd connection attempt is made? Did you enable AuthAltTypes=auth/jwt in slurmdbd.conf? cheers, Marcin Hi Marcin, I restarted slurmdbd and below the lines as soon as i try to do the call to the API [2021-04-07T14:44:35.135] error: g_slurm_auth_unpack: remote plugin_id 102 not found [2021-04-07T14:44:35.135] error: slurm_unpack_received_msg: g_slurm_auth_unpack: REQUEST_PERSIST_INIT has authentication error: No error [2021-04-07T14:44:35.135] error: slurm_unpack_received_msg: Header lengths are longer than data received [2021-04-07T14:44:35.145] error: CONN:8 Failed to unpack SLURM_PERSIST_INIT message Then I added the lines you suggested (they were already present on the slurm.conf) And when I restart the slurmdbd I've got this: [2021-04-07T14:47:13.498] error: Couldn't load specified plugin name for auth/jwt: Plugin init() callback failed [2021-04-07T14:47:13.499] error: cannot create auth context for auth/jwt [2021-04-07T14:47:13.499] fatal: Unable to initialize auth/munge authentication plugin Thank you Marco >[2021-04-07T14:47:13.498] error: Couldn't load specified plugin name for auth/jwt: Plugin init() callback failed I think you need: >AuthAltParameters=jwt_key=/path/to/jwt.key Am I correct that slurmrestd didn't segfault when you didn't have the backup host (the one that didn't reply) in config? cheers, Marcin Marcin, > > I think you need: > >AuthAltParameters=jwt_key=/path/to/jwt.key > I did 1) openssl genrsa -out /opt/slurm/default/etc/jwt_hs256.key 2048 2) chmod 0600 /opt/slurm/default/etc/jwt_hs256.key 3) Added AuthAltParameters=jwt_key=/opt/slurm/default/etc/jwt_hs256.key on slurm.conf and slurdbd.conf that are running on fojorina06 (not the slurmctld controller node) 4) restarted both slurmdbd and slurmrestd Here the tail of the logs: ==> /var/log/slurmdbd.log <== [2021-04-07T15:33:59.863] error: slurm_auth_verify: jwt_decode failure [2021-04-07T15:33:59.863] error: slurm_unpack_received_msg: g_slurm_auth_verify: REQUEST_PERSIST_INIT has authentication error: Unspecified error [2021-04-07T15:33:59.863] error: slurm_unpack_received_msg: Protocol authentication error [2021-04-07T15:33:59.873] error: CONN:11 Failed to unpack SLURM_PERSIST_INIT message ==> /var/log/messages <== Apr 7 15:33:59 fojorina06 slurmrestd[11563]: error: slurm_persist_conn_open: Something happened with the receiving/processing of the persistent connection init message to fojorina06:6819: Failed to unpack SLURM_PERSIST_INIT message Apr 7 15:33:59 fojorina06 slurmrestd[11563]: error: Sending PersistInit msg: No error Apr 7 15:33:59 fojorina06 slurmrestd[11563]: error: g_slurm_auth_pack: protocol_version 6500 not supported Apr 7 15:33:59 fojorina06 slurmrestd[11563]: error: slurm_send_node_msg: g_slurm_auth_pack: REQUEST_PERSIST_INIT has authentication error: Operation now in progress Apr 7 15:33:59 fojorina06 slurmrestd[11563]: error: slurm_persist_conn_open: failed to send persistent connection init message to fojorina06:6819 Apr 7 15:33:59 fojorina06 slurmrestd[11563]: error: Sending PersistInit msg: Protocol authentication error Apr 7 15:33:59 fojorina06 slurmrestd[11563]: error: DBD_GET_JOBS_COND failure: Unspecified error > > Am I correct that slurmrestd didn't segfault when you didn't have the backup > host (the one that didn't reply) in config? > yes, correct Marco Is slurmdbd able to open the /opt/slurm/default/etc/jwt_hs256.key? From permissions stand-point?
Can you restard slurmdbd with debug2 logging level and check if you see:
something like:
>slurmdbd: debug: auth/jwt: init: JWT authentication plugin loaded
>slurmdbd: debug2: AuthAltTypes = auth/jwt
>slurmdbd: debug2: AuthAltParameters = jwt_key=/[....]jwt_hs256.key
cheers,
Marcin
Marcin, > Is slurmdbd able to open the /opt/slurm/default/etc/jwt_hs256.key? From > permissions stand-point? slurmdbd is started as root so should be able to see the key > > Can you restard slurmdbd with debug2 logging level and check if you see: > something like: > >slurmdbd: debug: auth/jwt: init: JWT authentication plugin loaded > >slurmdbd: debug2: AuthAltTypes = auth/jwt > >slurmdbd: debug2: AuthAltParameters = jwt_key=/[....]jwt_hs256.key > Sure, here you are. [2021-04-07T16:12:08.499] debug: Log file re-opened [2021-04-07T16:12:08.501] debug: auth/munge: init: Munge authentication plugin loaded [2021-04-07T16:12:08.503] debug: auth/jwt: _init_key: _init_key: Loading key: /opt/slurm/default/etc/jwt_hs256.key [2021-04-07T16:12:08.504] debug: auth/jwt: init: JWT authentication plugin loaded [2021-04-07T16:12:08.504] debug2: accounting_storage/as_mysql: init: mysql_connect() called for db slurmdbd_tds [2021-04-07T16:12:08.509] debug2: Attempting to connect to fojorina06.cscs.ch:3306 [2021-04-07T16:12:08.511] accounting_storage/as_mysql: _check_mysql_concat_is_sane: MySQL server version is: 10.4.18-MariaDB [2021-04-07T16:12:08.512] debug2: accounting_storage/as_mysql: _check_database_variables: innodb_buffer_pool_size: 77309411328 [2021-04-07T16:12:08.513] debug2: accounting_storage/as_mysql: _check_database_variables: innodb_log_file_size: 50331648 [2021-04-07T16:12:08.514] debug2: accounting_storage/as_mysql: _check_database_variables: innodb_lock_wait_timeout: 900 [2021-04-07T16:12:09.056] accounting_storage/as_mysql: init: Accounting storage MYSQL plugin loaded [2021-04-07T16:12:09.056] debug2: ArchiveDir = /opt/slurm/default/archive [2021-04-07T16:12:09.056] debug2: ArchiveScript = (null) [2021-04-07T16:12:09.056] debug2: AuthAltTypes = auth/jwt [2021-04-07T16:12:09.056] debug2: AuthAltParameters = jwt_key=/opt/slurm/default/etc/jwt_hs256.key [2021-04-07T16:12:09.056] debug2: AuthInfo = (null) [2021-04-07T16:12:09.056] debug2: AuthType = auth/munge but as before at a first test apr 07 16:13:54 fojorina06.cscs.ch slurmrestd[24050]: error: slurm_persist_conn_open: Something happened with the receiving/processing of the persistent connection init message to fojorina06:6819: Failed to unpack SLURM_PERSIST_INIT message apr 07 16:13:54 fojorina06.cscs.ch slurmrestd[24050]: error: Sending PersistInit msg: No error apr 07 16:13:54 fojorina06.cscs.ch slurmrestd[24050]: error: g_slurm_auth_pack: protocol_version 6500 not supported apr 07 16:13:54 fojorina06.cscs.ch slurmrestd[24050]: error: slurm_send_node_msg: g_slurm_auth_pack: REQUEST_PERSIST_INIT has authentication error: Operation now in progress apr 07 16:13:54 fojorina06.cscs.ch slurmrestd[24050]: error: slurm_persist_conn_open: failed to send persistent connection init message to fojorina06:6819 apr 07 16:13:54 fojorina06.cscs.ch slurmrestd[24050]: error: Sending PersistInit msg: Protocol authentication error apr 07 16:13:54 fojorina06.cscs.ch slurmrestd[24050]: error: DBD_GET_JOBS_COND failure: Unspecified error Just to be sure, slurmctld is not involved/needed, right ? Because as of now, on the slurmctld controller node I didn't configured any jwt auth. Cheers, Marco Marco, Do you see any erorr in slurmdbd log while attempting connection? Can you just run sacct with SLURM_JWT variable set to token? cheers, Marcin Marcin if I do a simple query from a client with JWT_TOKEN configured it works curl -s -H X-SLURM-USER-NAME:minduni -H X-SLURM-USER-TOKEN:$SLURM_JWT -X POST 'http://fojorina06.cscs.ch:6821/slurm/v0.0.36/ping' | jq -r '.pings[] | "\(.hostname)/\(.ping)"' domsl01/UP dom101/UP If right after I do an sacct I get this from the client minduni@dom101:~> sacct sacct: error: slurm_persist_conn_open: Something happened with the receiving/processing of the persistent connection init message to fojorina06.cscs.ch:6819: Failed to unpack SLURM_PERSIST_INIT message sacct: error: slurm_persist_conn_open_without_init: failed to open persistent connection to host:fojorina07.cscs.ch:6819: Connection refused sacct: error: Sending PersistInit msg: Connection refused sacct: error: Problem talking to the database: Connection refused And here the corresponding slurmdbd.log ==> /var/log/slurmdbd.log <== [2021-04-08T16:54:18.052] error: slurm_auth_verify: jwt_decode failure [2021-04-08T16:54:18.052] error: slurm_unpack_received_msg: g_slurm_auth_verify: REQUEST_PERSIST_INIT has authentication error: Unspecified error [2021-04-08T16:54:18.052] error: slurm_unpack_received_msg: Protocol authentication error [2021-04-08T16:54:18.063] error: CONN:11 Failed to unpack SLURM_PERSIST_INIT message Cheers, Marco Was the token generated using the same jwt key as the one used in slurmdbd configuration? cheers, Marcin > Was the token generated using the same jwt key as the one used in slurmdbd > configuration? I generated the token directly on the slurmdbd node with scontrol token lifespan=90000 username=minduni > cheers, > Marcin Cheers, Marco >scontrol token lifespan=90000 username=minduni
It will send REQUEST_AUTH_TOKEN message to slurmctld so to get a token that will be accepted by slurmdbd you need the same key on both.
What's the output of scontrol token in your case I think you mentioned that slurmctld doesn't have auth/jwt enabled, or you just configured it to test /ping..?
cheers,
Marcin
Hi Marcin, so finally with your suggestions something start working. At the end I added the lines AuthAltTypes=auth/jwt AuthAltParameters=jwt_key=/opt/slurm/default/etc/jwt_hs256.key in slurm.conf and slurmdb.conf on the node were slurmdbd and slurmrestd are running and also in the slurm.conf of the slurmctld node. So a query like this one is working curl -s -H X-SLURM-USER-NAME:minduni -H X-SLURM-USER-TOKEN:$SLURM_JWT -X GET 'http://fojorina06.cscs.ch:6821/slurmdb/v0.0.36/clusters' | jq -r '.clusters[] | "\(.name)"' dom pilatus However I tried with jobs and those seems not to be working as expected or maybe the query is not correct, like (I just leaved the query part to be more readable): 1) Get jobinfo with id GET 'http://fojorina06.cscs.ch:6821/slurmdb/v0.0.36/job/1370778' The query took a lot and return a lot of jobs without any correlation and also very older ones 2) Get jobs started at a certain date and that belong to me (minduni) GET 'http://fojorina06.cscs.ch:6821/slurmdb/v0.0.36/jobs?start_time=2021-04-09&user=minduni' { "meta": { "plugin": { "type": "openapi\/dbv0.0.36", "name": "REST DB v0.0.36" }, "Slurm": { "version": { "major": 20, "micro": 4, "minor": 11 }, "release": "20.11.4" } }, "errors": [ { "description": "Unknown Query field", "error_number": 9000, "error": "Query empty or not RFC7320 compliant" } ] 3) Only my jobs GET 'http://fojorina06.cscs.ch:6821/slurmdb/v0.0.36/jobs?user=cardo' { "meta": { "plugin": { "type": "openapi\/dbv0.0.36", "name": "REST DB v0.0.36" }, "Slurm": { "version": { "major": 20, "micro": 4, "minor": 11 }, "release": "20.11.4" } }, "errors": [ { "description": "Unknown Query field", "error_number": 9000, "error": "Query empty or not RFC7320 compliant" } ] 4) Cluster list (this is working GET 'http://fojorina06.cscs.ch:6821/slurmdb/v0.0.36/clusters' | jq -r '.clusters[] | "\(.name)"' dom pilatus Any suggestions ? Are my queries wrong or something is not working as expected ? Cheers, Marco Marco, Do you mind opening a separate case on comment 17? I'd like to make sure that every question/issue will get proper attention, here we already merged two things (segfault and configuration questions). I'd like to keep this ticket focused on slurmrestd segfault. I can split the tickets doing copy&paste of comment 17, but in this case I'll be a reporter there (it's a bugzilla limitation). cheers, Marcin (In reply to Marcin Stolarek from comment #18) > Marco, > > Do you mind opening a separate case on comment 17? > I'd like to make sure that every question/issue will get proper attention, > here we already merged two things (segfault and configuration questions). > I'd like to keep this ticket focused on slurmrestd segfault. > > I can split the tickets doing copy&paste of comment 17, but in this case > I'll be a reporter there (it's a bugzilla limitation). > > cheers, > Marcin Marcin, no problem, I will do it right now. Cheers, Marco Marco, Do you agree that we can decrease the case severity to 4 as we know that segfault happens only when we can't connect to slurmdbd? cheers, Marcin (In reply to Marcin Stolarek from comment #20) > Marco, > > Do you agree that we can decrease the case severity to 4 as we know that > segfault happens only when we can't connect to slurmdbd? > > cheers, > Marcin Hi Marcin, yes no problem. This will end up on a patch, or will be tied to a new release ? Cheers, Marco Marco,
>This will end up on a patch, or will be tied to a new release ?
I'm not sure what you mean, for me, this is clearly a bug. Our normal approach in such a case is to prepare a patch in a bug report that gets merged with an appropriate minor or major release.
Sometimes we share the patch with a customer before it gets released in case of a critical situation when the site wants to apply the patch fix locally, before our QA and merge decision, or before an official release.
I hope the process is more clear for you now. Let me know if you have any questions.
cheers,
Marcin
(In reply to Marcin Stolarek from comment #22) > Marco, > > >This will end up on a patch, or will be tied to a new release ? > > I'm not sure what you mean, for me, this is clearly a bug. Our normal > approach in such a case is to prepare a patch in a bug report that gets > merged with an appropriate minor or major release. > > Sometimes we share the patch with a customer before it gets released in case > of a critical situation when the site wants to apply the patch fix locally, > before our QA and merge decision, or before an official release. > > I hope the process is more clear for you now. Let me know if you have any > questions. > > cheers, > Marcin Thank you for your clarification Marcin Cheers, Marco Marco, The reported issue is fixed in Slurm 20.11.8 release[1]. cheers, Marcin [1]https://github.com/SchedMD/slurm/commit/0c320f449bf920a09a4782fd51864d0d7bfa4f06 |