Ticket 14446

Summary: The slurmrestd crashes and using 100% of CPU
Product: Slurm Reporter: GSK-ONYX-SLURM <slurm-support>
Component: slurmrestdAssignee: Nate Rini <nate>
Status: RESOLVED FIXED QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: nate
Version: 22.05.2   
Hardware: Linux   
OS: Linux   
See Also: https://bugs.schedmd.com/show_bug.cgi?id=19268
Site: GSK Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: RHEL
Machine Name: CLE Version:
Version Fixed: 23.02.2, 23.11.0rc1 Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: test patch for GSK only
core.2123 archive
the gdb output 20230315
the gdb output 20230315-v1
the gdb output 20230316

Description GSK-ONYX-SLURM 2022-06-30 04:22:05 MDT
Hi Team,

From time to time we receive the alert about using 100% of CPU by the slurmrestd. The API is no longer working that time. Restarting the deamon helps. So, we decided to add a Restart=on-failure directive (see below) to the service file. So far so good, however this is a workaround, we would expect a permanent solution to apply.

Have you seen this before?

This is a service file:

[Unit]
Description=Slurm REST daemon
After=munge.service network.target remote-fs.target network-online.target autofs.service
ConditionPathExists=/etc/slurm/slurm.conf

[Service]
Restart=on-failure
RestartSec=5s
User=slurmrestapi
Group=slurmrestapi
Type=simple
EnvironmentFile=-/etc/sysconfig/slurmrestd
Environment="SLURM_JWT=daemon"
ExecStart=/home/slurm/Software/21.08.8-2-1/sbin/slurmrestd $SLURMRESTD_OPTIONS 0.0.0.0:8080 -f /etc/slurm/slurm-token.conf
ExecReload=/bin/kill -HUP $MAINPID

[Install]
WantedBy=multi-user.target

The logs when it's happening:

2022-03-11T03:32:22.510925-05:00 us1sxlx00338 slurmrestd[2464]: error: _handle_poll_event: [[us4ndvs007.corpnet1.com]:46590] poll error: Unexpected missing socket error
2022-03-11T03:32:22.511058-05:00 us1sxlx00338 slurmrestd[2464]: error: _handle_poll_event: [[us4ndvs007.corpnet1.com]:46590] poll error: Unexpected missing socket error

Thanks in advance for your suppport.
Radek
Comment 1 Nate Rini 2022-06-30 09:15:05 MDT
(In reply to GSK-EIS-SLURM from comment #0)
> [Service]
> Restart=on-failure
> RestartSec=5s
We generally don't suggest this as it could lead to lost updates. In this case, it appears to be an effective enough workaround.
 
> 2022-03-11T03:32:22.510925-05:00 us1sxlx00338 slurmrestd[2464]: error:
> _handle_poll_event: [[us4ndvs007.corpnet1.com]:46590] poll error: Unexpected
> missing socket error

This is a fail-safe error where the socket disappeared, lost its state, or had its error overwritten.

> From time to time we receive the alert about using 100% of CPU by the
> slurmrestd. The API is no longer working that time. Restarting the deamon
> helps.

Have any cores been taken while it's in this state? Please attach the full slurmrestd logs if possible. Please also check dmesg for errors around these times.
Comment 3 Nate Rini 2022-06-30 09:36:05 MDT
Created attachment 25716 [details]
test patch for GSK only

(In reply to Nate Rini from comment #1)
> (In reply to GSK-EIS-SLURM from comment #0)
> > 2022-03-11T03:32:22.510925-05:00 us1sxlx00338 slurmrestd[2464]: error:
> > _handle_poll_event: [[us4ndvs007.corpnet1.com]:46590] poll error: Unexpected
> > missing socket error
> 
> This is a fail-safe error where the socket disappeared, lost its state, or
> had its error overwritten.

Please also note that this error has been improved by bug#13050 in slurm-22.05 release.

I have attached a ported copy of the patch to apply to your system to see if we can get a better error. Please note that we generally only apply improvements to the current master branch and not tagged releases. Your site is also welcome to upgrade to 22.05 to get this patch.
Comment 4 GSK-ONYX-SLURM 2022-07-03 22:59:21 MDT
(In reply to Nate Rini from comment #1)
> This is a fail-safe error where the socket disappeared, lost its state, or
> had its error overwritten.
> 
> > From time to time we receive the alert about using 100% of CPU by the
> > slurmrestd. The API is no longer working that time. Restarting the deamon
> > helps.
> 
> Have any cores been taken while it's in this state? Please attach the full
> slurmrestd logs if possible. Please also check dmesg for errors around these
> times.

No, nothing has changed since the VM had been provisioned.

I took a look at logs and unfortunately they have been rotated and I cannot paste anything else. The issue happens occasionally, last time it was at the beginning of June, if I'm not mistaken. It's not related to the one cluster only; most of our clusters were affected by this error, even dev/text environment, where it's quiet and no one use it... 

There was no issue under 20.x version, where it was possible to run the daemon from the root user. I don't know if it may something to do, but I guess it's worth to mention. At least, I can't remember that the issue was reported that time.
Comment 5 GSK-ONYX-SLURM 2022-07-03 23:19:06 MDT
(In reply to Nate Rini from comment #3)
> Please also note that this error has been improved by bug#13050 in
> slurm-22.05 release.

The workaround I mentioned in the comment 0 has been applied across all the clusters, so if it works and the service will be restarted every time the issue happens, we wont see it anymore. I guess it can be visible under logs so reviewing them from time to time may help to find something.
 
> I have attached a ported copy of the patch to apply to your system to see if
> we can get a better error. Please note that we generally only apply
> improvements to the current master branch and not tagged releases. Your site
> is also welcome to upgrade to 22.05 to get this patch.

Thanks a lot. I have a plan to start upgrading clusters at the beginning of August. 

I haven't updated the service files on our sandboxes yet. So I will keep an eye on these clusters to see if I can find the error that was handled.

Thanks Nate!
Comment 6 GSK-ONYX-SLURM 2022-07-07 06:58:28 MDT
Hi Nate,
let's close the ticket. Once the Slurm is upgraded to the 22.05.x version I will back to this and re-open the ticket if needed. 

Thanks,
Radek
Comment 7 Nate Rini 2022-07-11 14:46:26 MDT
(In reply to GSK-EIS-SLURM from comment #6)
> Hi Nate,
> let's close the ticket. Once the Slurm is upgraded to the 22.05.x version I
> will back to this and re-open the ticket if needed. 

Closing ticket per last response. Please re-open if the issue persists after the upgrade.
Comment 8 GSK-ONYX-SLURM 2023-01-23 00:22:46 MST
Dear SchedMD Team,

I am going to reopen this ticket as the issue is still persisting. I did try to catch as much details as I could, but to be honest the error hasn't improved since last time. I mean, I don't see more details in logs to be captured.

The slurmrestd runs as a systemd service from the slurmrestapi user. The systemd config file has not changed since last time and can be seen in the comment0. The only thing that has changed is the ExecStart= path as we upgraded SLurm to the 22.05.2 version. It didn't help anyhow.

Below you can see the output taken from top:

Tasks: 236 total,   2 running, 234 sleeping,   0 stopped,   0 zombie
%Cpu(s): 33.4 us, 63.8 sy,  0.0 ni,  2.8 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem :  3861252 total,   177008 free,   570128 used,  3114116 buff/cache
KiB Swap:  8388604 total,  8370932 free,    17672 used.  2808692 avail Mem

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 78725 slurmre+  20   0 1094764   7756   2680 S 103.3  0.2   1477:21 slurmrestd
   517 root      20   0   63592  24108  23944 R  89.7  0.6   3087:07 systemd-journal
  1867 root      20   0  214104   5848   3544 S   0.3  0.2  67:21.43 vmtoolsd
 18384 root      20   0  439592 124360   6716 S   0.3  3.2 291:02.74 splunkd
 20850 root      20   0       0      0      0 S   0.3  0.0   0:00.46 kworker/0:3
 26038 root      20   0  168564   2596   1712 R   0.3  0.1   0:00.02 top
     1 root      20   0  125884   3912   2308 S   0.0  0.1  33:49.49 systemd
     2 root      20   0       0      0      0 S   0.0  0.0   0:02.04 kthreadd
     4 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 kworker/0:0H
     6 root      20   0       0      0      0 S   0.0  0.0   5:37.99 ksoftirqd/0


/var/log/messages - the lines keep repeating:

2023-01-22T04:19:11.667795-05:00 us1sxlx00179 slurmrestd[78725]: error: _handle_poll_event: [[us1ndvs21.corpnet1.com]:46232] poll error: fd_get_socket_error failed Bad file descriptor
2023-01-22T04:19:11.668044-05:00 us1sxlx00179 slurmrestd[78725]: error: _handle_poll_event: [[us1ndvs21.corpnet1.com]:46232] poll error: fd_get_socket_error failed Bad file descriptor
2023-01-22T04:19:11.668295-05:00 us1sxlx00179 slurmrestd[78725]: error: _handle_poll_event: [[us1ndvs21.corpnet1.com]:46232] poll error: fd_get_socket_error failed Bad file descriptor
2023-01-22T04:19:11.668554-05:00 us1sxlx00179 slurmrestd[78725]: error: _handle_poll_event: [[us1ndvs21.corpnet1.com]:46232] poll error: fd_get_socket_error failed Bad file descriptor
2023-01-22T04:19:11.668790-05:00 us1sxlx00179 slurmrestd[78725]: error: _handle_poll_event: [[us1ndvs21.corpnet1.com]:46232] poll error: fd_get_socket_error failed Bad file descriptor
2023-01-22T04:19:11.669029-05:00 us1sxlx00179 slurmrestd[78725]: error: _handle_poll_event: [[us1ndvs21.corpnet1.com]:46232] poll error: fd_get_socket_error failed Bad file descriptor
2023-01-22T04:19:11.669285-05:00 us1sxlx00179 slurmrestd[78725]: error: _handle_poll_event: [[us1ndvs21.corpnet1.com]:46232] poll error: fd_get_socket_error failed Bad file descriptor

The systemd status:

[root@us1sxlx00179 ~]# systemctl -l status slurmrestd.service
● slurmrestd.service - Slurm REST daemon
   Loaded: loaded (/usr/lib/systemd/system/slurmrestd.service; enabled; vendor preset: disabled)
   Active: active (running) since Thu 2022-12-22 14:08:35 EST; 1 months 1 days ago
 Main PID: 78725 (slurmrestd)
   CGroup: /system.slice/slurmrestd.service
           └─78725 /home/slurm/Software/RHEL7/slurm/22.05.2/sbin/slurmrestd 0.0.0.0:8080 -f /etc/slurm/slurm-token.conf

Jan 23 01:58:42 us1sxlx00179.corpnet2.com slurmrestd[78725]: error: _handle_poll_event: [[us1ndvs21.corpnet1.com]:46232] poll error: fd_get_socket_error failed Bad file descriptor
Jan 23 01:58:42 us1sxlx00179.corpnet2.com slurmrestd[78725]: error: _handle_poll_event: [[us1ndvs21.corpnet1.com]:46232] poll error: fd_get_socket_error failed Bad file descriptor
Jan 23 01:58:42 us1sxlx00179.corpnet2.com slurmrestd[78725]: error: _handle_poll_event: [[us1ndvs21.corpnet1.com]:46232] poll error: fd_get_socket_error failed Bad file descriptor
Jan 23 01:58:42 us1sxlx00179.corpnet2.com slurmrestd[78725]: error: _handle_poll_event: [[us1ndvs21.corpnet1.com]:46232] poll error: fd_get_socket_error failed Bad file descriptor
Jan 23 01:58:42 us1sxlx00179.corpnet2.com slurmrestd[78725]: error: _handle_poll_event: [[us1ndvs21.corpnet1.com]:46232] poll error: fd_get_socket_error failed Bad file descriptor
Jan 23 01:58:42 us1sxlx00179.corpnet2.com slurmrestd[78725]: error: _handle_poll_event: [[us1ndvs21.corpnet1.com]:46232] poll error: fd_get_socket_error failed Bad file descriptor
Jan 23 01:58:42 us1sxlx00179.corpnet2.com slurmrestd[78725]: error: _handle_poll_event: [[us1ndvs21.corpnet1.com]:46232] poll error: fd_get_socket_error failed Bad file descriptor
Jan 23 01:58:42 us1sxlx00179.corpnet2.com slurmrestd[78725]: error: _handle_poll_event: [[us1ndvs21.corpnet1.com]:46232] poll error: fd_get_socket_error failed Bad file descriptor
Jan 23 01:58:42 us1sxlx00179.corpnet2.com slurmrestd[78725]: error: _handle_poll_event: [[us1ndvs21.corpnet1.com]:46232] poll error: fd_get_socket_error failed Bad file descriptor
Jan 23 01:58:42 us1sxlx00179.corpnet2.com slurmrestd[78725]: error: _handle_poll_event: [[us1ndvs21.corpnet1.com]:46232] poll error: fd_get_socket_error failed Bad file descriptor
[root@us1sxlx00179 ~]#


I'm not allowed to see the https://bugs.schedmd.com/show_bug.cgi?id=13050 bug.

Any thoughts?

Thanks,
Radek
Comment 9 GSK-ONYX-SLURM 2023-01-23 00:23:48 MST
Reopening the ticket.
Comment 10 GSK-ONYX-SLURM 2023-01-26 03:52:19 MST
Hi Nate,

Have you had a chance to look at the ticket and the update I added?

Thanks,
Radek
Comment 11 Nate Rini 2023-01-26 07:49:33 MST
(In reply to GSK-EIS-SLURM from comment #10)
> Have you had a chance to look at the ticket and the update I added?

Looking at it now.
Comment 13 Nate Rini 2023-01-26 08:03:49 MST
(In reply to GSK-EIS-SLURM from comment #8)
> I'm not allowed to see the https://bugs.schedmd.com/show_bug.cgi?id=13050
> bug.

This bug resulted in these changes being pushed upstream:
> *   dddeb9f0ff (HEAD -> master, origin/master, origin/HEAD) Merge branch 'bug13050'
> |\  
> | * da1147c05b (bug13050) Check return code of fd_get_socket_error() and don't use errno
> | * c688b2a1c2 check return code of fd_get_socket_error
> | * fa8f37147d We shouldn't pass &errno to getsockopt
> | * 35c9236d50 fd_get_socket_error - use SLURM_COMMUNICATIONS_MISSING_SOCKET_ERROR
> |/

Looks like this is an unrelated issue.
Comment 14 Nate Rini 2023-01-26 08:07:45 MST
(In reply to GSK-EIS-SLURM from comment #8)
> Jan 23 01:58:42 us1sxlx00179.corpnet2.com slurmrestd[78725]: error: _handle_poll_event: [[us1ndvs21.corpnet1.com]:46232] poll error: fd_get_socket_error failed Bad file descriptor

This error should have caused the connection to be closed yet `[[us1ndvs21.corpnet1.com]:46232]` continues dumping errors. Please activate slurmrestd with the following environmental variables:
> SLURM_DEBUG_FLAGS=net
> SLURMRESTD_DEBUG=4

It will dump a lot of logs. Please just run it until `poll error` errors start and then stop it. Please attach the logs and revert the env change for normal production.

I assume slurmrestd is able to handle normal communications even though it is using 100% of a CPU?
Comment 15 GSK-ONYX-SLURM 2023-01-26 08:44:15 MST
(In reply to Nate Rini from comment #14)

> This error should have caused the connection to be closed yet
> `[[us1ndvs21.corpnet1.com]:46232]` continues dumping errors. Please activate
> slurmrestd with the following environmental variables:
> > SLURM_DEBUG_FLAGS=net
> > SLURMRESTD_DEBUG=4
> 
> It will dump a lot of logs. Please just run it until `poll error` errors
> start and then stop it. Please attach the logs and revert the env change for
> normal production.

Sure, will do that, but it may take some time until the issue appears again. 

> 
> I assume slurmrestd is able to handle normal communications even though it
> is using 100% of a CPU?

HPC users haven't reported anything related to slurmrestd while it's using 100% of CPU, but I will check that and confirm.

I cannot tell you when it happens again, so perhaps you'd like to close the ticket and I will reopen it if needed. What's your take on that?

Cheers,
Radek
Comment 16 Nate Rini 2023-01-26 10:11:24 MST
(In reply to GSK-EIS-SLURM from comment #15)
> (In reply to Nate Rini from comment #14)
> > I assume slurmrestd is able to handle normal communications even though it
> > is using 100% of a CPU?
> 
> HPC users haven't reported anything related to slurmrestd while it's using
> 100% of CPU, but I will check that and confirm.

Good, then the design of slurmrestd is still working as intended.

> I cannot tell you when it happens again, so perhaps you'd like to close the
> ticket and I will reopen it if needed. What's your take on that?

I'll tag the ticket as timed out. The first reply will automatically re-open it and we can analyze the logs then.
Comment 17 GSK-ONYX-SLURM 2023-02-08 23:27:22 MST
(In reply to Nate Rini from comment #14)

> It will dump a lot of logs. Please just run it until `poll error` errors
> start and then stop it. Please attach the logs and revert the env change for
> normal production.

The issue happened again. This is what I was able to capture once I had added the flags to the env file:

2023-02-07T20:52:22.688441-05:00 us1sxlx00179 slurmrestd: slurmrestd: error: parse_http: [[us4ndvs013.corpnet1.com]:43348] unexpected HTTP error HPE_INVALID_METHOD: invalid HTTP method
2023-02-07T20:52:22.688749-05:00 us1sxlx00179 slurmrestd[112047]: error: parse_http: [[us4ndvs013.corpnet1.com]:43348] unexpected HTTP error HPE_INVALID_METHOD: invalid HTTP method
2023-02-07T20:52:22.689015-05:00 us1sxlx00179 slurmrestd: slurmrestd: error: _wrap_on_data: [[us4ndvs013.corpnet1.com]:43348] on_data returned rc: Operation not permitted
2023-02-07T20:52:22.689279-05:00 us1sxlx00179 slurmrestd[112047]: error: _wrap_on_data: [[us4ndvs013.corpnet1.com]:43348] on_data returned rc: Operation not permitted
2023-02-07T20:52:31.710686-05:00 us1sxlx00179 slurmrestd[112047]: error: parse_http: [[us4ndvs013.corpnet1.com]:43734] unexpected HTTP error HPE_INVALID_METHOD: invalid HTTP method
2023-02-07T20:52:31.710950-05:00 us1sxlx00179 slurmrestd: slurmrestd: error: parse_http: [[us4ndvs013.corpnet1.com]:43734] unexpected HTTP error HPE_INVALID_METHOD: invalid HTTP method
2023-02-07T20:52:31.711197-05:00 us1sxlx00179 slurmrestd: slurmrestd: error: _wrap_on_data: [[us4ndvs013.corpnet1.com]:43734] on_data returned rc: Operation not permitted
2023-02-07T20:52:31.711431-05:00 us1sxlx00179 slurmrestd[112047]: error: _wrap_on_data: [[us4ndvs013.corpnet1.com]:43734] on_data returned rc: Operation not permitted
2023-02-07T20:52:31.728043-05:00 us1sxlx00179 slurmrestd[112047]: error: parse_http: [[us4ndvs013.corpnet1.com]:43742] unexpected HTTP error HPE_INVALID_METHOD: invalid HTTP method
2023-02-07T20:52:31.728301-05:00 us1sxlx00179 slurmrestd: slurmrestd: error: parse_http: [[us4ndvs013.corpnet1.com]:43742] unexpected HTTP error HPE_INVALID_METHOD: invalid HTTP method
2023-02-07T20:52:31.728520-05:00 us1sxlx00179 slurmrestd: slurmrestd: error: _wrap_on_data: [[us4ndvs013.corpnet1.com]:43742] on_data returned rc: Operation not permitted
2023-02-07T20:52:31.728740-05:00 us1sxlx00179 slurmrestd[112047]: error: _wrap_on_data: [[us4ndvs013.corpnet1.com]:43742] on_data returned rc: Operation not permitted
2023-02-07T20:52:31.751721-05:00 us1sxlx00179 slurmrestd[112047]: error: parse_http: [[us4ndvs013.corpnet1.com]:43748] unexpected HTTP error HPE_INVALID_METHOD: invalid HTTP method
2023-02-07T20:52:31.751953-05:00 us1sxlx00179 slurmrestd: slurmrestd: error: parse_http: [[us4ndvs013.corpnet1.com]:43748] unexpected HTTP error HPE_INVALID_METHOD: invalid HTTP method
2023-02-07T20:52:31.752222-05:00 us1sxlx00179 slurmrestd[112047]: error: _wrap_on_data: [[us4ndvs013.corpnet1.com]:43748] on_data returned rc: Operation not permitted
2023-02-07T20:52:31.753232-05:00 us1sxlx00179 slurmrestd[112047]: error: _handle_poll_event: [[us4ndvs013.corpnet1.com]:43748] poll error: fd_get_socket_error failed Bad file descriptor
2023-02-07T20:52:31.753460-05:00 us1sxlx00179 slurmrestd[112047]: error: _handle_poll_event: [[us4ndvs013.corpnet1.com]:43748] poll error: fd_get_socket_error failed Bad file descriptor
2023-02-07T20:52:31.753695-05:00 us1sxlx00179 slurmrestd[112047]: error: _handle_poll_event: [[us4ndvs013.corpnet1.com]:43748] poll error: fd_get_socket_error failed Bad file descriptor
2023-02-07T20:52:31.753911-05:00 us1sxlx00179 slurmrestd[112047]: error: _handle_poll_event: [[us4ndvs013.corpnet1.com]:43748] poll error: fd_get_socket_error failed Bad file descriptor
2023-02-07T20:52:31.754146-05:00 us1sxlx00179 slurmrestd[112047]: error: _handle_poll_event: [[us4ndvs013.corpnet1.com]:43748] poll error: fd_get_socket_error failed Bad file descriptor
2023-02-07T20:52:31.754359-05:00 us1sxlx00179 slurmrestd[112047]: error: _handle_poll_event: [[us4ndvs013.corpnet1.com]:43748] poll error: fd_get_socket_error failed Bad file descriptor
2023-02-07T20:52:31.754574-05:00 us1sxlx00179 slurmrestd[112047]: error: _handle_poll_event: [[us4ndvs013.corpnet1.com]:43748] poll error: fd_get_socket_error failed Bad file descriptor
2023-02-07T20:52:31.754788-05:00 us1sxlx00179 slurmrestd[112047]: error: _handle_poll_event: [[us4ndvs013.corpnet1.com]:43748] poll error: fd_get_socket_error failed Bad file descriptor
[...]

> 
> I assume slurmrestd is able to handle normal communications even though it
> is using 100% of a CPU?

Yes, I did check it and I was able to submit a job using the Slurm API.

Cheers,
Radek
Comment 18 Nate Rini 2023-02-10 17:28:33 MST
(In reply to GSK-EIS-SLURM from comment #17)
> > I assume slurmrestd is able to handle normal communications even though it
> > is using 100% of a CPU?
> 
> Yes, I did check it and I was able to submit a job using the Slurm API.

Was a core taken (using gcore) while it was in this state?
Comment 19 GSK-ONYX-SLURM 2023-02-12 23:41:00 MST
(In reply to Nate Rini from comment #18)

> 
> Was a core taken (using gcore) while it was in this state?

Nope, I wasn't asked to do that. Shall I do this next time when the issue occurs again?
Comment 20 Nate Rini 2023-02-13 11:30:08 MST
(In reply to GSK-EIS-SLURM from comment #19)
> (In reply to Nate Rini from comment #18)
> 
> > 
> > Was a core taken (using gcore) while it was in this state?
> 
> Nope, I wasn't asked to do that. Shall I do this next time when the issue
> occurs again?

Yes, please grab a core when it is having the issue. The logs presented so far don't provide enough information to determine the cause here.
Comment 21 Nate Rini 2023-03-01 12:17:12 MST
I'm going to mark this ticket as timed out. Once there is a core, please respond to this ticket, and we can debug the issue with the core. Until then, there isn't much that can be done on our side.
Comment 22 GSK-ONYX-SLURM 2023-03-14 02:42:50 MDT
Hi Nate / SchedMD Team,

I was able to capture the gcore - see a file attached.

[root@uk1sxlx00213 ~]# systemctl -l status slurmrestd.service
● slurmrestd.service - Slurm REST daemon
   Loaded: loaded (/usr/lib/systemd/system/slurmrestd.service; enabled; vendor preset: disabled)
   Active: active (running) since Sun 2023-02-26 17:12:22 GMT; 2 weeks 1 days ago
 Main PID: 2123 (slurmrestd)
   CGroup: /system.slice/slurmrestd.service
           └─2123 /home/slurm/Software/RHEL7/slurm/22.05.2/sbin/slurmrestd 0.0.0.0:8080 -f /etc/slurm/slurm-token.conf

Mar 14 08:27:42 uk1sxlx00213.corpnet2.com slurmrestd[2123]: error: _handle_poll_event: [[uk2ndvs009.corpnet1.com]:41268] poll error: fd_get_socket_error failed Bad file descriptor
Mar 14 08:27:42 uk1sxlx00213.corpnet2.com slurmrestd[2123]: error: _handle_poll_event: [[uk2ndvs009.corpnet1.com]:41268] poll error: fd_get_socket_error failed Bad file descriptor
Mar 14 08:27:42 uk1sxlx00213.corpnet2.com slurmrestd[2123]: error: _handle_poll_event: [[uk2ndvs009.corpnet1.com]:41268] poll error: fd_get_socket_error failed Bad file descriptor
Mar 14 08:27:42 uk1sxlx00213.corpnet2.com slurmrestd[2123]: error: _handle_poll_event: [[uk2ndvs009.corpnet1.com]:41268] poll error: fd_get_socket_error failed Bad file descriptor
Mar 14 08:27:42 uk1sxlx00213.corpnet2.com slurmrestd[2123]: error: _handle_poll_event: [[uk2ndvs009.corpnet1.com]:41268] poll error: fd_get_socket_error failed Bad file descriptor
Mar 14 08:27:42 uk1sxlx00213.corpnet2.com slurmrestd[2123]: error: _handle_poll_event: [[uk2ndvs009.corpnet1.com]:41268] poll error: fd_get_socket_error failed Bad file descriptor
Mar 14 08:27:42 uk1sxlx00213.corpnet2.com slurmrestd[2123]: error: _handle_poll_event: [[uk2ndvs009.corpnet1.com]:41268] poll error: fd_get_socket_error failed Bad file descriptor
Mar 14 08:27:42 uk1sxlx00213.corpnet2.com slurmrestd[2123]: error: _handle_poll_event: [[uk2ndvs009.corpnet1.com]:41268] poll error: fd_get_socket_error failed Bad file descriptor
Mar 14 08:27:42 uk1sxlx00213.corpnet2.com slurmrestd[2123]: error: _handle_poll_event: [[uk2ndvs009.corpnet1.com]:41268] poll error: fd_get_socket_error failed Bad file descriptor
Mar 14 08:27:42 uk1sxlx00213.corpnet2.com slurmrestd[2123]: error: _handle_poll_event: [[uk2ndvs009.corpnet1.com]:41268] poll error: fd_get_socket_error failed Bad file descriptor
[root@uk1sxlx00213 ~]#
[root@uk1sxlx00213 ~]#

The uk2ndvs009 device which keeps repeating in the logs is a vulnerability scanner. There's the tenable software installed there. 

[root@uk1sxlx00213 ~]#
[root@uk1sxlx00213 ~]#
[root@uk1sxlx00213 ~]# ps -ef | grep slurmrestd
slurmre+   2123      1  5 Feb26 ?        22:28:46 /home/slurm/Software/RHEL7/slurm/22.05.2/sbin/slurmrestd 0.0.0.0:8080 -f /etc/slurm/slurm-token.conf
root      64588  64206  0 08:28 pts/0    00:00:00 grep slurmrestd
[root@uk1sxlx00213 ~]#
[root@uk1sxlx00213 ~]#
[root@uk1sxlx00213 ~]#
[root@uk1sxlx00213 ~]# gcore 2123
[New LWP 2184]
[New LWP 2183]
[New LWP 2182]
[New LWP 2181]
[New LWP 2180]
[New LWP 2179]
[New LWP 2178]
[New LWP 2177]
[New LWP 2176]
[New LWP 2175]
[New LWP 2174]
[New LWP 2173]
[New LWP 2172]
[New LWP 2171]
[New LWP 2170]
[New LWP 2169]
[New LWP 2168]
[New LWP 2167]
[New LWP 2166]
[New LWP 2165]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
0x00007f945c5e8a35 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
warning: target file /proc/2123/cmdline contained unexpected null characters
Saved corefile core.2123
[Inferior 1 (process 2123) detached]
[root@uk1sxlx00213 ~]#

Let me know if there's anything else you need.

Thanks,
Radek
Comment 23 GSK-ONYX-SLURM 2023-03-14 02:43:47 MDT
Created attachment 29311 [details]
core.2123 archive
Comment 24 Nate Rini 2023-03-14 08:16:46 MDT
(In reply to GSK-EIS-SLURM from comment #23)
> Created attachment 29311 [details]
> core.2123 archive

Please keep the core locally. It is only helpful with all of the loaded libraries from the time of dumping.
Comment 25 Nate Rini 2023-03-14 08:19:22 MDT
(In reply to GSK-EIS-SLURM from comment #22)
> Let me know if there's anything else you need.

Please call this using gdb
> gdb $(which slurmrestd) core.2123 
>> set pagination off
>> set print pretty on
>> t a a bt full

Once I have that, I will likely need the output of a few more gdb commands.
Comment 26 GSK-ONYX-SLURM 2023-03-15 06:23:38 MDT
(In reply to Nate Rini from comment #24)

> Please keep the core locally. It is only helpful with all of the loaded
> libraries from the time of dumping.

My apologies, I've never used gcore before... 


(In reply to Nate Rini from comment #25)

> Please call this using gdb
> > gdb $(which slurmrestd) core.2123 
> >> set pagination off
> >> set print pretty on
> >> t a a bt full
> 
> Once I have that, I will likely need the output of a few more gdb commands.

The txt file attached. Not sure if I executed everything as you requested. Please let me know if not.

Once we have the core dump file, can I restart the slurmrestd service so it stops consuming 100% of cpu, or something may still be needed?

Radek
Comment 27 GSK-ONYX-SLURM 2023-03-15 06:24:19 MDT
Created attachment 29330 [details]
the gdb output 20230315
Comment 28 Nate Rini 2023-03-15 08:47:22 MDT
Looks like slurmrestd is not in your user's path:
> [root@uk1sxlx00213 tmp]# gdb $(which slurmrestd) core.2123
> which: no slurmrestd in (/sbin:/bin:/usr/sbin:/usr/bin:/root/bin:/root/bin)

Can you please redo the procedure but place the path for slurmrestd as the second arg?

> # gdb $PATH_TO_SLURMRESTD core.2123

Other than that, everything else looked correct.
Comment 29 GSK-ONYX-SLURM 2023-03-15 11:15:27 MDT
(In reply to Nate Rini from comment #28)
> Looks like slurmrestd is not in your user's path:
> > [root@uk1sxlx00213 tmp]# gdb $(which slurmrestd) core.2123
> > which: no slurmrestd in (/sbin:/bin:/usr/sbin:/usr/bin:/root/bin:/root/bin)
> 
> Can you please redo the procedure but place the path for slurmrestd as the
> second arg?

Ups, I overlooked this, apologies again. File attached.

Thanks,
Radek
Comment 30 GSK-ONYX-SLURM 2023-03-15 11:16:20 MDT
Created attachment 29340 [details]
the gdb output 20230315-v1
Comment 32 Nate Rini 2023-03-15 12:31:48 MDT
Please call this in gdb same as before:
> set pagination off
> set print pretty on
> t 21
> f 1
> p *mgr
Comment 33 GSK-ONYX-SLURM 2023-03-16 00:04:40 MDT
Created attachment 29350 [details]
the gdb output 20230316
Comment 36 Nate Rini 2023-03-16 07:59:56 MDT
Please call this in gdb same as before:
> set pagination off
> set print pretty on
> t 21
> f 1
> p *(con_mgr_fd_t *) mgr->connections->head->data
Comment 37 GSK-ONYX-SLURM 2023-03-17 00:42:22 MDT
(In reply to Nate Rini from comment #36)
> Please call this in gdb same as before:
> > set pagination off
> > set print pretty on
> > t 21
> > f 1
> > p *(con_mgr_fd_t *) mgr->connections->head->data

Here you are:

[root@uk1sxlx00213 tmp]# gdb /home/slurm/Software/RHEL7/slurm/22.05.2/sbin/slurmrestd core.2123
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-120.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /home/slurm/Software/RHEL7/slurm/22.05.2/sbin/slurmrestd...done.
[New LWP 2165]
[New LWP 2166]
[New LWP 2167]
[New LWP 2168]
[New LWP 2169]
[New LWP 2170]
[New LWP 2171]
[New LWP 2172]
[New LWP 2173]
[New LWP 2174]
[New LWP 2175]
[New LWP 2176]
[New LWP 2177]
[New LWP 2178]
[New LWP 2179]
[New LWP 2180]
[New LWP 2181]
[New LWP 2182]
[New LWP 2183]
[New LWP 2184]
[New LWP 2123]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/home/slurm/Software/RHEL7/slurm/22.05.2/sbin/slurmrestd'.
#0  0x00007f945c5e8a35 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
Missing separate debuginfos, use: debuginfo-install glibc-2.17-326.el7_9.x86_64 http-parser-2.7.1-9.el7.x86_64 json-c-0.11-4.el7_0.x86_64 keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.15.1-55.el7_9.x86_64 libcom_err-1.42.9-19.el7.x86_64 libselinux-2.5-15.el7.x86_64 openssl-libs-1.0.2k-25.el7_9.x86_64 pcre-8.32-17.el7.x86_64 zlib-1.2.7-20.el7_9.x86_64
(gdb) set pagination off
(gdb) set print pretty on
(gdb) t 21
[Switching to thread 21 (Thread 0x7f945d1e3740 (LWP 2123))]
#0  0x00007f945c5e8a35 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
(gdb) f 1
#1  0x00007f945cc85646 in _watch (mgr=0x21eca10) at conmgr.c:1624
1624    conmgr.c: No such file or directory.
(gdb) p *(con_mgr_fd_t *) mgr->connections->head->data
$1 = {
  magic = -768326417,
  input_fd = -1,
  output_fd = 8,
  arg = 0x7f94380030e0,
  name = 0x7f9438002b70 "[uk2ndvs009.corpnet1.com]:41268",
  events = {
    on_connection = 0x405822 <_setup_http_context>,
    on_data = 0x404374 <parse_http>,
    on_finish = 0x40484c <on_http_connection_finish>
  },
  in = 0x7f9438000950,
  on_data_tried = false,
  out = 0x7f9438002b30,
  is_socket = true,
  unix_socket = 0x0,
  is_listen = false,
  can_write = false,
  can_read = false,
  read_eof = true,
  is_connected = true,
  has_work = false,
  work = 0x7f9438000a70,
  mgr = 0x21eca10
}
(gdb) quit
[root@uk1sxlx00213 tmp]#
Comment 39 Nate Rini 2023-03-17 09:59:33 MDT
That provided the information I needed. Working on a corrective patch.
Comment 46 GSK-ONYX-SLURM 2023-03-27 01:23:12 MDT
(In reply to Nate Rini from comment #39)
> That provided the information I needed. Working on a corrective patch.

Hi Nate, when do you think the path will be ready? Will it also be implemented to the latest Slurm, for instance to 23.02.1..? 

Thanks,
Radek
Comment 48 Nate Rini 2023-03-27 11:07:39 MDT
(In reply to GSK-EIS-SLURM from comment #46)
> (In reply to Nate Rini from comment #39)
> > That provided the information I needed. Working on a corrective patch.
> 
> Hi Nate, when do you think the path will be ready? Will it also be
> implemented to the latest Slurm, for instance to 23.02.1..? 

During QA testing of the patch, we found an issue, so the fix most likely won't make it in for the 23.02.1 release but should make 23.02.2 release.
Comment 67 Nate Rini 2023-04-11 08:25:07 MDT
(In reply to Nate Rini from comment #48)
> (In reply to GSK-EIS-SLURM from comment #46)
> > (In reply to Nate Rini from comment #39)
> > > That provided the information I needed. Working on a corrective patch.
> > 
> > Hi Nate, when do you think the path will be ready? Will it also be
> > implemented to the latest Slurm, for instance to 23.02.1..? 
> 
> During QA testing of the patch, we found an issue, so the fix most likely
> won't make it in for the 23.02.1 release but should make 23.02.2 release.

This has now been fixed for the upcoming Slurm-23.02.2 release:
> https://github.com/SchedMD/slurm/compare/b70c43caa7...581fa24d6d