| Summary: | The slurmrestd crashes and using 100% of CPU | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | GSK-ONYX-SLURM <slurm-support> |
| Component: | slurmrestd | Assignee: | Nate Rini <nate> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | CC: | nate |
| Version: | 22.05.2 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| See Also: | https://bugs.schedmd.com/show_bug.cgi?id=19268 | ||
| Site: | GSK | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | RHEL |
| Machine Name: | CLE Version: | ||
| Version Fixed: | 23.02.2, 23.11.0rc1 | Target Release: | --- |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: |
test patch for GSK only
core.2123 archive the gdb output 20230315 the gdb output 20230315-v1 the gdb output 20230316 |
||
|
Description
GSK-ONYX-SLURM
2022-06-30 04:22:05 MDT
(In reply to GSK-EIS-SLURM from comment #0) > [Service] > Restart=on-failure > RestartSec=5s We generally don't suggest this as it could lead to lost updates. In this case, it appears to be an effective enough workaround. > 2022-03-11T03:32:22.510925-05:00 us1sxlx00338 slurmrestd[2464]: error: > _handle_poll_event: [[us4ndvs007.corpnet1.com]:46590] poll error: Unexpected > missing socket error This is a fail-safe error where the socket disappeared, lost its state, or had its error overwritten. > From time to time we receive the alert about using 100% of CPU by the > slurmrestd. The API is no longer working that time. Restarting the deamon > helps. Have any cores been taken while it's in this state? Please attach the full slurmrestd logs if possible. Please also check dmesg for errors around these times. Created attachment 25716 [details] test patch for GSK only (In reply to Nate Rini from comment #1) > (In reply to GSK-EIS-SLURM from comment #0) > > 2022-03-11T03:32:22.510925-05:00 us1sxlx00338 slurmrestd[2464]: error: > > _handle_poll_event: [[us4ndvs007.corpnet1.com]:46590] poll error: Unexpected > > missing socket error > > This is a fail-safe error where the socket disappeared, lost its state, or > had its error overwritten. Please also note that this error has been improved by bug#13050 in slurm-22.05 release. I have attached a ported copy of the patch to apply to your system to see if we can get a better error. Please note that we generally only apply improvements to the current master branch and not tagged releases. Your site is also welcome to upgrade to 22.05 to get this patch. (In reply to Nate Rini from comment #1) > This is a fail-safe error where the socket disappeared, lost its state, or > had its error overwritten. > > > From time to time we receive the alert about using 100% of CPU by the > > slurmrestd. The API is no longer working that time. Restarting the deamon > > helps. > > Have any cores been taken while it's in this state? Please attach the full > slurmrestd logs if possible. Please also check dmesg for errors around these > times. No, nothing has changed since the VM had been provisioned. I took a look at logs and unfortunately they have been rotated and I cannot paste anything else. The issue happens occasionally, last time it was at the beginning of June, if I'm not mistaken. It's not related to the one cluster only; most of our clusters were affected by this error, even dev/text environment, where it's quiet and no one use it... There was no issue under 20.x version, where it was possible to run the daemon from the root user. I don't know if it may something to do, but I guess it's worth to mention. At least, I can't remember that the issue was reported that time. (In reply to Nate Rini from comment #3) > Please also note that this error has been improved by bug#13050 in > slurm-22.05 release. The workaround I mentioned in the comment 0 has been applied across all the clusters, so if it works and the service will be restarted every time the issue happens, we wont see it anymore. I guess it can be visible under logs so reviewing them from time to time may help to find something. > I have attached a ported copy of the patch to apply to your system to see if > we can get a better error. Please note that we generally only apply > improvements to the current master branch and not tagged releases. Your site > is also welcome to upgrade to 22.05 to get this patch. Thanks a lot. I have a plan to start upgrading clusters at the beginning of August. I haven't updated the service files on our sandboxes yet. So I will keep an eye on these clusters to see if I can find the error that was handled. Thanks Nate! Hi Nate, let's close the ticket. Once the Slurm is upgraded to the 22.05.x version I will back to this and re-open the ticket if needed. Thanks, Radek (In reply to GSK-EIS-SLURM from comment #6) > Hi Nate, > let's close the ticket. Once the Slurm is upgraded to the 22.05.x version I > will back to this and re-open the ticket if needed. Closing ticket per last response. Please re-open if the issue persists after the upgrade. Dear SchedMD Team, I am going to reopen this ticket as the issue is still persisting. I did try to catch as much details as I could, but to be honest the error hasn't improved since last time. I mean, I don't see more details in logs to be captured. The slurmrestd runs as a systemd service from the slurmrestapi user. The systemd config file has not changed since last time and can be seen in the comment0. The only thing that has changed is the ExecStart= path as we upgraded SLurm to the 22.05.2 version. It didn't help anyhow. Below you can see the output taken from top: Tasks: 236 total, 2 running, 234 sleeping, 0 stopped, 0 zombie %Cpu(s): 33.4 us, 63.8 sy, 0.0 ni, 2.8 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st KiB Mem : 3861252 total, 177008 free, 570128 used, 3114116 buff/cache KiB Swap: 8388604 total, 8370932 free, 17672 used. 2808692 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 78725 slurmre+ 20 0 1094764 7756 2680 S 103.3 0.2 1477:21 slurmrestd 517 root 20 0 63592 24108 23944 R 89.7 0.6 3087:07 systemd-journal 1867 root 20 0 214104 5848 3544 S 0.3 0.2 67:21.43 vmtoolsd 18384 root 20 0 439592 124360 6716 S 0.3 3.2 291:02.74 splunkd 20850 root 20 0 0 0 0 S 0.3 0.0 0:00.46 kworker/0:3 26038 root 20 0 168564 2596 1712 R 0.3 0.1 0:00.02 top 1 root 20 0 125884 3912 2308 S 0.0 0.1 33:49.49 systemd 2 root 20 0 0 0 0 S 0.0 0.0 0:02.04 kthreadd 4 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kworker/0:0H 6 root 20 0 0 0 0 S 0.0 0.0 5:37.99 ksoftirqd/0 /var/log/messages - the lines keep repeating: 2023-01-22T04:19:11.667795-05:00 us1sxlx00179 slurmrestd[78725]: error: _handle_poll_event: [[us1ndvs21.corpnet1.com]:46232] poll error: fd_get_socket_error failed Bad file descriptor 2023-01-22T04:19:11.668044-05:00 us1sxlx00179 slurmrestd[78725]: error: _handle_poll_event: [[us1ndvs21.corpnet1.com]:46232] poll error: fd_get_socket_error failed Bad file descriptor 2023-01-22T04:19:11.668295-05:00 us1sxlx00179 slurmrestd[78725]: error: _handle_poll_event: [[us1ndvs21.corpnet1.com]:46232] poll error: fd_get_socket_error failed Bad file descriptor 2023-01-22T04:19:11.668554-05:00 us1sxlx00179 slurmrestd[78725]: error: _handle_poll_event: [[us1ndvs21.corpnet1.com]:46232] poll error: fd_get_socket_error failed Bad file descriptor 2023-01-22T04:19:11.668790-05:00 us1sxlx00179 slurmrestd[78725]: error: _handle_poll_event: [[us1ndvs21.corpnet1.com]:46232] poll error: fd_get_socket_error failed Bad file descriptor 2023-01-22T04:19:11.669029-05:00 us1sxlx00179 slurmrestd[78725]: error: _handle_poll_event: [[us1ndvs21.corpnet1.com]:46232] poll error: fd_get_socket_error failed Bad file descriptor 2023-01-22T04:19:11.669285-05:00 us1sxlx00179 slurmrestd[78725]: error: _handle_poll_event: [[us1ndvs21.corpnet1.com]:46232] poll error: fd_get_socket_error failed Bad file descriptor The systemd status: [root@us1sxlx00179 ~]# systemctl -l status slurmrestd.service ● slurmrestd.service - Slurm REST daemon Loaded: loaded (/usr/lib/systemd/system/slurmrestd.service; enabled; vendor preset: disabled) Active: active (running) since Thu 2022-12-22 14:08:35 EST; 1 months 1 days ago Main PID: 78725 (slurmrestd) CGroup: /system.slice/slurmrestd.service └─78725 /home/slurm/Software/RHEL7/slurm/22.05.2/sbin/slurmrestd 0.0.0.0:8080 -f /etc/slurm/slurm-token.conf Jan 23 01:58:42 us1sxlx00179.corpnet2.com slurmrestd[78725]: error: _handle_poll_event: [[us1ndvs21.corpnet1.com]:46232] poll error: fd_get_socket_error failed Bad file descriptor Jan 23 01:58:42 us1sxlx00179.corpnet2.com slurmrestd[78725]: error: _handle_poll_event: [[us1ndvs21.corpnet1.com]:46232] poll error: fd_get_socket_error failed Bad file descriptor Jan 23 01:58:42 us1sxlx00179.corpnet2.com slurmrestd[78725]: error: _handle_poll_event: [[us1ndvs21.corpnet1.com]:46232] poll error: fd_get_socket_error failed Bad file descriptor Jan 23 01:58:42 us1sxlx00179.corpnet2.com slurmrestd[78725]: error: _handle_poll_event: [[us1ndvs21.corpnet1.com]:46232] poll error: fd_get_socket_error failed Bad file descriptor Jan 23 01:58:42 us1sxlx00179.corpnet2.com slurmrestd[78725]: error: _handle_poll_event: [[us1ndvs21.corpnet1.com]:46232] poll error: fd_get_socket_error failed Bad file descriptor Jan 23 01:58:42 us1sxlx00179.corpnet2.com slurmrestd[78725]: error: _handle_poll_event: [[us1ndvs21.corpnet1.com]:46232] poll error: fd_get_socket_error failed Bad file descriptor Jan 23 01:58:42 us1sxlx00179.corpnet2.com slurmrestd[78725]: error: _handle_poll_event: [[us1ndvs21.corpnet1.com]:46232] poll error: fd_get_socket_error failed Bad file descriptor Jan 23 01:58:42 us1sxlx00179.corpnet2.com slurmrestd[78725]: error: _handle_poll_event: [[us1ndvs21.corpnet1.com]:46232] poll error: fd_get_socket_error failed Bad file descriptor Jan 23 01:58:42 us1sxlx00179.corpnet2.com slurmrestd[78725]: error: _handle_poll_event: [[us1ndvs21.corpnet1.com]:46232] poll error: fd_get_socket_error failed Bad file descriptor Jan 23 01:58:42 us1sxlx00179.corpnet2.com slurmrestd[78725]: error: _handle_poll_event: [[us1ndvs21.corpnet1.com]:46232] poll error: fd_get_socket_error failed Bad file descriptor [root@us1sxlx00179 ~]# I'm not allowed to see the https://bugs.schedmd.com/show_bug.cgi?id=13050 bug. Any thoughts? Thanks, Radek Reopening the ticket. Hi Nate, Have you had a chance to look at the ticket and the update I added? Thanks, Radek (In reply to GSK-EIS-SLURM from comment #10) > Have you had a chance to look at the ticket and the update I added? Looking at it now. (In reply to GSK-EIS-SLURM from comment #8) > I'm not allowed to see the https://bugs.schedmd.com/show_bug.cgi?id=13050 > bug. This bug resulted in these changes being pushed upstream: > * dddeb9f0ff (HEAD -> master, origin/master, origin/HEAD) Merge branch 'bug13050' > |\ > | * da1147c05b (bug13050) Check return code of fd_get_socket_error() and don't use errno > | * c688b2a1c2 check return code of fd_get_socket_error > | * fa8f37147d We shouldn't pass &errno to getsockopt > | * 35c9236d50 fd_get_socket_error - use SLURM_COMMUNICATIONS_MISSING_SOCKET_ERROR > |/ Looks like this is an unrelated issue. (In reply to GSK-EIS-SLURM from comment #8) > Jan 23 01:58:42 us1sxlx00179.corpnet2.com slurmrestd[78725]: error: _handle_poll_event: [[us1ndvs21.corpnet1.com]:46232] poll error: fd_get_socket_error failed Bad file descriptor This error should have caused the connection to be closed yet `[[us1ndvs21.corpnet1.com]:46232]` continues dumping errors. Please activate slurmrestd with the following environmental variables: > SLURM_DEBUG_FLAGS=net > SLURMRESTD_DEBUG=4 It will dump a lot of logs. Please just run it until `poll error` errors start and then stop it. Please attach the logs and revert the env change for normal production. I assume slurmrestd is able to handle normal communications even though it is using 100% of a CPU? (In reply to Nate Rini from comment #14) > This error should have caused the connection to be closed yet > `[[us1ndvs21.corpnet1.com]:46232]` continues dumping errors. Please activate > slurmrestd with the following environmental variables: > > SLURM_DEBUG_FLAGS=net > > SLURMRESTD_DEBUG=4 > > It will dump a lot of logs. Please just run it until `poll error` errors > start and then stop it. Please attach the logs and revert the env change for > normal production. Sure, will do that, but it may take some time until the issue appears again. > > I assume slurmrestd is able to handle normal communications even though it > is using 100% of a CPU? HPC users haven't reported anything related to slurmrestd while it's using 100% of CPU, but I will check that and confirm. I cannot tell you when it happens again, so perhaps you'd like to close the ticket and I will reopen it if needed. What's your take on that? Cheers, Radek (In reply to GSK-EIS-SLURM from comment #15) > (In reply to Nate Rini from comment #14) > > I assume slurmrestd is able to handle normal communications even though it > > is using 100% of a CPU? > > HPC users haven't reported anything related to slurmrestd while it's using > 100% of CPU, but I will check that and confirm. Good, then the design of slurmrestd is still working as intended. > I cannot tell you when it happens again, so perhaps you'd like to close the > ticket and I will reopen it if needed. What's your take on that? I'll tag the ticket as timed out. The first reply will automatically re-open it and we can analyze the logs then. (In reply to Nate Rini from comment #14) > It will dump a lot of logs. Please just run it until `poll error` errors > start and then stop it. Please attach the logs and revert the env change for > normal production. The issue happened again. This is what I was able to capture once I had added the flags to the env file: 2023-02-07T20:52:22.688441-05:00 us1sxlx00179 slurmrestd: slurmrestd: error: parse_http: [[us4ndvs013.corpnet1.com]:43348] unexpected HTTP error HPE_INVALID_METHOD: invalid HTTP method 2023-02-07T20:52:22.688749-05:00 us1sxlx00179 slurmrestd[112047]: error: parse_http: [[us4ndvs013.corpnet1.com]:43348] unexpected HTTP error HPE_INVALID_METHOD: invalid HTTP method 2023-02-07T20:52:22.689015-05:00 us1sxlx00179 slurmrestd: slurmrestd: error: _wrap_on_data: [[us4ndvs013.corpnet1.com]:43348] on_data returned rc: Operation not permitted 2023-02-07T20:52:22.689279-05:00 us1sxlx00179 slurmrestd[112047]: error: _wrap_on_data: [[us4ndvs013.corpnet1.com]:43348] on_data returned rc: Operation not permitted 2023-02-07T20:52:31.710686-05:00 us1sxlx00179 slurmrestd[112047]: error: parse_http: [[us4ndvs013.corpnet1.com]:43734] unexpected HTTP error HPE_INVALID_METHOD: invalid HTTP method 2023-02-07T20:52:31.710950-05:00 us1sxlx00179 slurmrestd: slurmrestd: error: parse_http: [[us4ndvs013.corpnet1.com]:43734] unexpected HTTP error HPE_INVALID_METHOD: invalid HTTP method 2023-02-07T20:52:31.711197-05:00 us1sxlx00179 slurmrestd: slurmrestd: error: _wrap_on_data: [[us4ndvs013.corpnet1.com]:43734] on_data returned rc: Operation not permitted 2023-02-07T20:52:31.711431-05:00 us1sxlx00179 slurmrestd[112047]: error: _wrap_on_data: [[us4ndvs013.corpnet1.com]:43734] on_data returned rc: Operation not permitted 2023-02-07T20:52:31.728043-05:00 us1sxlx00179 slurmrestd[112047]: error: parse_http: [[us4ndvs013.corpnet1.com]:43742] unexpected HTTP error HPE_INVALID_METHOD: invalid HTTP method 2023-02-07T20:52:31.728301-05:00 us1sxlx00179 slurmrestd: slurmrestd: error: parse_http: [[us4ndvs013.corpnet1.com]:43742] unexpected HTTP error HPE_INVALID_METHOD: invalid HTTP method 2023-02-07T20:52:31.728520-05:00 us1sxlx00179 slurmrestd: slurmrestd: error: _wrap_on_data: [[us4ndvs013.corpnet1.com]:43742] on_data returned rc: Operation not permitted 2023-02-07T20:52:31.728740-05:00 us1sxlx00179 slurmrestd[112047]: error: _wrap_on_data: [[us4ndvs013.corpnet1.com]:43742] on_data returned rc: Operation not permitted 2023-02-07T20:52:31.751721-05:00 us1sxlx00179 slurmrestd[112047]: error: parse_http: [[us4ndvs013.corpnet1.com]:43748] unexpected HTTP error HPE_INVALID_METHOD: invalid HTTP method 2023-02-07T20:52:31.751953-05:00 us1sxlx00179 slurmrestd: slurmrestd: error: parse_http: [[us4ndvs013.corpnet1.com]:43748] unexpected HTTP error HPE_INVALID_METHOD: invalid HTTP method 2023-02-07T20:52:31.752222-05:00 us1sxlx00179 slurmrestd[112047]: error: _wrap_on_data: [[us4ndvs013.corpnet1.com]:43748] on_data returned rc: Operation not permitted 2023-02-07T20:52:31.753232-05:00 us1sxlx00179 slurmrestd[112047]: error: _handle_poll_event: [[us4ndvs013.corpnet1.com]:43748] poll error: fd_get_socket_error failed Bad file descriptor 2023-02-07T20:52:31.753460-05:00 us1sxlx00179 slurmrestd[112047]: error: _handle_poll_event: [[us4ndvs013.corpnet1.com]:43748] poll error: fd_get_socket_error failed Bad file descriptor 2023-02-07T20:52:31.753695-05:00 us1sxlx00179 slurmrestd[112047]: error: _handle_poll_event: [[us4ndvs013.corpnet1.com]:43748] poll error: fd_get_socket_error failed Bad file descriptor 2023-02-07T20:52:31.753911-05:00 us1sxlx00179 slurmrestd[112047]: error: _handle_poll_event: [[us4ndvs013.corpnet1.com]:43748] poll error: fd_get_socket_error failed Bad file descriptor 2023-02-07T20:52:31.754146-05:00 us1sxlx00179 slurmrestd[112047]: error: _handle_poll_event: [[us4ndvs013.corpnet1.com]:43748] poll error: fd_get_socket_error failed Bad file descriptor 2023-02-07T20:52:31.754359-05:00 us1sxlx00179 slurmrestd[112047]: error: _handle_poll_event: [[us4ndvs013.corpnet1.com]:43748] poll error: fd_get_socket_error failed Bad file descriptor 2023-02-07T20:52:31.754574-05:00 us1sxlx00179 slurmrestd[112047]: error: _handle_poll_event: [[us4ndvs013.corpnet1.com]:43748] poll error: fd_get_socket_error failed Bad file descriptor 2023-02-07T20:52:31.754788-05:00 us1sxlx00179 slurmrestd[112047]: error: _handle_poll_event: [[us4ndvs013.corpnet1.com]:43748] poll error: fd_get_socket_error failed Bad file descriptor [...] > > I assume slurmrestd is able to handle normal communications even though it > is using 100% of a CPU? Yes, I did check it and I was able to submit a job using the Slurm API. Cheers, Radek (In reply to GSK-EIS-SLURM from comment #17) > > I assume slurmrestd is able to handle normal communications even though it > > is using 100% of a CPU? > > Yes, I did check it and I was able to submit a job using the Slurm API. Was a core taken (using gcore) while it was in this state? (In reply to Nate Rini from comment #18) > > Was a core taken (using gcore) while it was in this state? Nope, I wasn't asked to do that. Shall I do this next time when the issue occurs again? (In reply to GSK-EIS-SLURM from comment #19) > (In reply to Nate Rini from comment #18) > > > > > Was a core taken (using gcore) while it was in this state? > > Nope, I wasn't asked to do that. Shall I do this next time when the issue > occurs again? Yes, please grab a core when it is having the issue. The logs presented so far don't provide enough information to determine the cause here. I'm going to mark this ticket as timed out. Once there is a core, please respond to this ticket, and we can debug the issue with the core. Until then, there isn't much that can be done on our side. Hi Nate / SchedMD Team,
I was able to capture the gcore - see a file attached.
[root@uk1sxlx00213 ~]# systemctl -l status slurmrestd.service
● slurmrestd.service - Slurm REST daemon
Loaded: loaded (/usr/lib/systemd/system/slurmrestd.service; enabled; vendor preset: disabled)
Active: active (running) since Sun 2023-02-26 17:12:22 GMT; 2 weeks 1 days ago
Main PID: 2123 (slurmrestd)
CGroup: /system.slice/slurmrestd.service
└─2123 /home/slurm/Software/RHEL7/slurm/22.05.2/sbin/slurmrestd 0.0.0.0:8080 -f /etc/slurm/slurm-token.conf
Mar 14 08:27:42 uk1sxlx00213.corpnet2.com slurmrestd[2123]: error: _handle_poll_event: [[uk2ndvs009.corpnet1.com]:41268] poll error: fd_get_socket_error failed Bad file descriptor
Mar 14 08:27:42 uk1sxlx00213.corpnet2.com slurmrestd[2123]: error: _handle_poll_event: [[uk2ndvs009.corpnet1.com]:41268] poll error: fd_get_socket_error failed Bad file descriptor
Mar 14 08:27:42 uk1sxlx00213.corpnet2.com slurmrestd[2123]: error: _handle_poll_event: [[uk2ndvs009.corpnet1.com]:41268] poll error: fd_get_socket_error failed Bad file descriptor
Mar 14 08:27:42 uk1sxlx00213.corpnet2.com slurmrestd[2123]: error: _handle_poll_event: [[uk2ndvs009.corpnet1.com]:41268] poll error: fd_get_socket_error failed Bad file descriptor
Mar 14 08:27:42 uk1sxlx00213.corpnet2.com slurmrestd[2123]: error: _handle_poll_event: [[uk2ndvs009.corpnet1.com]:41268] poll error: fd_get_socket_error failed Bad file descriptor
Mar 14 08:27:42 uk1sxlx00213.corpnet2.com slurmrestd[2123]: error: _handle_poll_event: [[uk2ndvs009.corpnet1.com]:41268] poll error: fd_get_socket_error failed Bad file descriptor
Mar 14 08:27:42 uk1sxlx00213.corpnet2.com slurmrestd[2123]: error: _handle_poll_event: [[uk2ndvs009.corpnet1.com]:41268] poll error: fd_get_socket_error failed Bad file descriptor
Mar 14 08:27:42 uk1sxlx00213.corpnet2.com slurmrestd[2123]: error: _handle_poll_event: [[uk2ndvs009.corpnet1.com]:41268] poll error: fd_get_socket_error failed Bad file descriptor
Mar 14 08:27:42 uk1sxlx00213.corpnet2.com slurmrestd[2123]: error: _handle_poll_event: [[uk2ndvs009.corpnet1.com]:41268] poll error: fd_get_socket_error failed Bad file descriptor
Mar 14 08:27:42 uk1sxlx00213.corpnet2.com slurmrestd[2123]: error: _handle_poll_event: [[uk2ndvs009.corpnet1.com]:41268] poll error: fd_get_socket_error failed Bad file descriptor
[root@uk1sxlx00213 ~]#
[root@uk1sxlx00213 ~]#
The uk2ndvs009 device which keeps repeating in the logs is a vulnerability scanner. There's the tenable software installed there.
[root@uk1sxlx00213 ~]#
[root@uk1sxlx00213 ~]#
[root@uk1sxlx00213 ~]# ps -ef | grep slurmrestd
slurmre+ 2123 1 5 Feb26 ? 22:28:46 /home/slurm/Software/RHEL7/slurm/22.05.2/sbin/slurmrestd 0.0.0.0:8080 -f /etc/slurm/slurm-token.conf
root 64588 64206 0 08:28 pts/0 00:00:00 grep slurmrestd
[root@uk1sxlx00213 ~]#
[root@uk1sxlx00213 ~]#
[root@uk1sxlx00213 ~]#
[root@uk1sxlx00213 ~]# gcore 2123
[New LWP 2184]
[New LWP 2183]
[New LWP 2182]
[New LWP 2181]
[New LWP 2180]
[New LWP 2179]
[New LWP 2178]
[New LWP 2177]
[New LWP 2176]
[New LWP 2175]
[New LWP 2174]
[New LWP 2173]
[New LWP 2172]
[New LWP 2171]
[New LWP 2170]
[New LWP 2169]
[New LWP 2168]
[New LWP 2167]
[New LWP 2166]
[New LWP 2165]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
0x00007f945c5e8a35 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
warning: target file /proc/2123/cmdline contained unexpected null characters
Saved corefile core.2123
[Inferior 1 (process 2123) detached]
[root@uk1sxlx00213 ~]#
Let me know if there's anything else you need.
Thanks,
Radek
Created attachment 29311 [details]
core.2123 archive
(In reply to GSK-EIS-SLURM from comment #23) > Created attachment 29311 [details] > core.2123 archive Please keep the core locally. It is only helpful with all of the loaded libraries from the time of dumping. (In reply to GSK-EIS-SLURM from comment #22) > Let me know if there's anything else you need. Please call this using gdb > gdb $(which slurmrestd) core.2123 >> set pagination off >> set print pretty on >> t a a bt full Once I have that, I will likely need the output of a few more gdb commands. (In reply to Nate Rini from comment #24) > Please keep the core locally. It is only helpful with all of the loaded > libraries from the time of dumping. My apologies, I've never used gcore before... (In reply to Nate Rini from comment #25) > Please call this using gdb > > gdb $(which slurmrestd) core.2123 > >> set pagination off > >> set print pretty on > >> t a a bt full > > Once I have that, I will likely need the output of a few more gdb commands. The txt file attached. Not sure if I executed everything as you requested. Please let me know if not. Once we have the core dump file, can I restart the slurmrestd service so it stops consuming 100% of cpu, or something may still be needed? Radek Created attachment 29330 [details]
the gdb output 20230315
Looks like slurmrestd is not in your user's path: > [root@uk1sxlx00213 tmp]# gdb $(which slurmrestd) core.2123 > which: no slurmrestd in (/sbin:/bin:/usr/sbin:/usr/bin:/root/bin:/root/bin) Can you please redo the procedure but place the path for slurmrestd as the second arg? > # gdb $PATH_TO_SLURMRESTD core.2123 Other than that, everything else looked correct. (In reply to Nate Rini from comment #28) > Looks like slurmrestd is not in your user's path: > > [root@uk1sxlx00213 tmp]# gdb $(which slurmrestd) core.2123 > > which: no slurmrestd in (/sbin:/bin:/usr/sbin:/usr/bin:/root/bin:/root/bin) > > Can you please redo the procedure but place the path for slurmrestd as the > second arg? Ups, I overlooked this, apologies again. File attached. Thanks, Radek Created attachment 29340 [details]
the gdb output 20230315-v1
Please call this in gdb same as before:
> set pagination off
> set print pretty on
> t 21
> f 1
> p *mgr
Created attachment 29350 [details]
the gdb output 20230316
Please call this in gdb same as before:
> set pagination off
> set print pretty on
> t 21
> f 1
> p *(con_mgr_fd_t *) mgr->connections->head->data
(In reply to Nate Rini from comment #36) > Please call this in gdb same as before: > > set pagination off > > set print pretty on > > t 21 > > f 1 > > p *(con_mgr_fd_t *) mgr->connections->head->data Here you are: [root@uk1sxlx00213 tmp]# gdb /home/slurm/Software/RHEL7/slurm/22.05.2/sbin/slurmrestd core.2123 GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-120.el7 Copyright (C) 2013 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-redhat-linux-gnu". For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>... Reading symbols from /home/slurm/Software/RHEL7/slurm/22.05.2/sbin/slurmrestd...done. [New LWP 2165] [New LWP 2166] [New LWP 2167] [New LWP 2168] [New LWP 2169] [New LWP 2170] [New LWP 2171] [New LWP 2172] [New LWP 2173] [New LWP 2174] [New LWP 2175] [New LWP 2176] [New LWP 2177] [New LWP 2178] [New LWP 2179] [New LWP 2180] [New LWP 2181] [New LWP 2182] [New LWP 2183] [New LWP 2184] [New LWP 2123] [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". Core was generated by `/home/slurm/Software/RHEL7/slurm/22.05.2/sbin/slurmrestd'. #0 0x00007f945c5e8a35 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 Missing separate debuginfos, use: debuginfo-install glibc-2.17-326.el7_9.x86_64 http-parser-2.7.1-9.el7.x86_64 json-c-0.11-4.el7_0.x86_64 keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.15.1-55.el7_9.x86_64 libcom_err-1.42.9-19.el7.x86_64 libselinux-2.5-15.el7.x86_64 openssl-libs-1.0.2k-25.el7_9.x86_64 pcre-8.32-17.el7.x86_64 zlib-1.2.7-20.el7_9.x86_64 (gdb) set pagination off (gdb) set print pretty on (gdb) t 21 [Switching to thread 21 (Thread 0x7f945d1e3740 (LWP 2123))] #0 0x00007f945c5e8a35 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 (gdb) f 1 #1 0x00007f945cc85646 in _watch (mgr=0x21eca10) at conmgr.c:1624 1624 conmgr.c: No such file or directory. (gdb) p *(con_mgr_fd_t *) mgr->connections->head->data $1 = { magic = -768326417, input_fd = -1, output_fd = 8, arg = 0x7f94380030e0, name = 0x7f9438002b70 "[uk2ndvs009.corpnet1.com]:41268", events = { on_connection = 0x405822 <_setup_http_context>, on_data = 0x404374 <parse_http>, on_finish = 0x40484c <on_http_connection_finish> }, in = 0x7f9438000950, on_data_tried = false, out = 0x7f9438002b30, is_socket = true, unix_socket = 0x0, is_listen = false, can_write = false, can_read = false, read_eof = true, is_connected = true, has_work = false, work = 0x7f9438000a70, mgr = 0x21eca10 } (gdb) quit [root@uk1sxlx00213 tmp]# That provided the information I needed. Working on a corrective patch. (In reply to Nate Rini from comment #39) > That provided the information I needed. Working on a corrective patch. Hi Nate, when do you think the path will be ready? Will it also be implemented to the latest Slurm, for instance to 23.02.1..? Thanks, Radek (In reply to GSK-EIS-SLURM from comment #46) > (In reply to Nate Rini from comment #39) > > That provided the information I needed. Working on a corrective patch. > > Hi Nate, when do you think the path will be ready? Will it also be > implemented to the latest Slurm, for instance to 23.02.1..? During QA testing of the patch, we found an issue, so the fix most likely won't make it in for the 23.02.1 release but should make 23.02.2 release. (In reply to Nate Rini from comment #48) > (In reply to GSK-EIS-SLURM from comment #46) > > (In reply to Nate Rini from comment #39) > > > That provided the information I needed. Working on a corrective patch. > > > > Hi Nate, when do you think the path will be ready? Will it also be > > implemented to the latest Slurm, for instance to 23.02.1..? > > During QA testing of the patch, we found an issue, so the fix most likely > won't make it in for the 23.02.1 release but should make 23.02.2 release. This has now been fixed for the upcoming Slurm-23.02.2 release: > https://github.com/SchedMD/slurm/compare/b70c43caa7...581fa24d6d |