| Summary: | Error messages in the logs | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Marco Induni <marco.induni> |
| Component: | slurmrestd | Assignee: | Marcin Stolarek <cinek> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | CC: | cinek, nate |
| Version: | 20.11.4 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | CSCS - Swiss National Supercomputing Centre | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: |
Strace dump
Network log during an API request |
||
|
Description
Marco Induni
2021-04-20 03:23:51 MDT
(In reply to Marco Induni from comment #0) > Apr 20 11:17:17 fojorina05 slurmrestd: slurmrestd: error: > _handle_poll_event: [[fojorina04.cscs.ch]:44254] poll error: Unspecified > error > Apr 20 11:17:17 fojorina05 slurmrestd[16223]: error: _handle_poll_event: > [[fojorina04.cscs.ch]:44254] poll error: Unspecified error The incorrect error is getting dumped here. This has already been fixed in bug#11098 and will be included in 20.11.6. > Is something should I care or I can ignore it ? Generally, if poll() is returning an error then something happened to the connection outside of Slurm's control. Since the patchset for bug#11098 is a little large, I rather not suggest just patching it in. Is it possible to run slurmrestd with strace instead and provide that output? Created attachment 19048 [details]
Strace dump
Dear Nate,
attached the strace dump of the running process while I did the query and get the error.
Apparently seems that this error is thrown out when the query is executed not directly to the node, but trough a proxy server.
Could that be an explanation (I'm still testing if this is always the case)
Kind regards,
Marco
(In reply to Marco Induni from comment #2) > Created attachment 19048 [details] > Strace dump > 16236 <... poll resumed>) = 1 ([{fd=8, revents=POLLIN|POLLERR|POLLHUP}]) The strace log shows that the connection was closed (probably reset) with incoming data. > Apparently seems that this error is thrown out when the query is executed > not directly to the node, but trough a proxy server. > > Could that be an explanation (I'm still testing if this is always the case) It's possible the proxy errored out and closed the connection early. Was there an error in the proxy's logs? Hi,
> It's possible the proxy errored out and closed the connection early. Was
> there an error in the proxy's logs?
no error found in the log
Marco
Marco, Are we talking about a testing/non-production environment? If yes please enable DebugFlags=Network and share the logs from the time you reproduce the issue. What is the proxy server? How long does it take to get the result of the query when you use curl directly against slurmrestd (this works fine?)? cheers, Marcin (In reply to Marcin Stolarek from comment #7) Hi Marcin, > > enable DebugFlags=Network and share the logs from the time you reproduce the > issue. > Attached you will find the log > What is the proxy server? How long does it take to get the result of the > query when you use curl directly against slurmrestd (this works fine?)? > The time doesn't change much, and we use haproxy 1.8.26 Thank you Marco Created attachment 19086 [details]
Network log during an API request
>> What is the proxy server? How long does it take to get the result of the
>> query when you use curl directly against slurmrestd (this works fine?)?
>>
>The time doesn't change much, and we use haproxy 1.8.26
Thanks, but could you please answer the initial question - "How long does it take to get the result of the query when you use curl directly against slurmrestd", for instance, wrapping curl execution by the `time` command?
cheers,
Marcin
Marco, Could you please take a look at comment 11. The debug logs you shared show that we got a combination of POLLIN|POLLHUP|POLLERR flags, based on glibc documentation of POLLHUP: > POLLHUP > Hang up (only returned in revents; ignored in events). Note that >when reading from a channel such as a pipe or a stream socket, this event merely >indicates that the peer closed its end of the channel. Sub‐ >sequent reads from the channel will return 0 (end of file) only after all >outstanding data in the channel has been consumed. which suggests that connection was closed by (timed-out - that's why I'm intereted in time needed to get the reply directly from slurmrestd) from the HTTP proxy side. cheers, Marcin Dear Marcin, here the timings as requeted: via proxy time curl -s -H X-SLURM-USER-NAME:minduni -H X-SLURM-USER-TOKEN:$SLURM_JWT -X GET 'http://slurmdb.cscs.ch:6822/slurm/v0.0.36/jobs' real 0m10,198s user 0m0,012s sys 0m1,742s via slurmrestd time curl -s -H X-SLURM-USER-NAME:minduni -H X-SLURM-USER-TOKEN:$SLURM_JWT -X GET 'http://fojorina04.cscs.ch:6821/slurm/v0.0.36/jobs' real 0m10,528s user 0m0,000s sys 0m1,725s Thank you Kind regards, Marco (In reply to Marcin Stolarek from comment #12) > Marco, > > Could you please take a look at comment 11. > > The debug logs you shared show that we got a combination of > POLLIN|POLLHUP|POLLERR flags, based on glibc documentation of POLLHUP: > > > POLLHUP > > Hang up (only returned in revents; ignored in events). Note that > >when reading from a channel such as a pipe or a stream socket, this event merely > >indicates that the peer closed its end of the channel. Sub‐ > >sequent reads from the channel will return 0 (end of file) only after all > >outstanding data in the channel has been consumed. > > which suggests that connection was closed by (timed-out - that's why I'm > intereted in time needed to get the reply directly from slurmrestd) from the > HTTP proxy side. > > cheers, > Marcin What is the value of `timeout server` in your haproxy configuration? Looking at the code the only reason I can find is that second end(haproxy) closed the connection. cheers, Marcin Marco, Did you have time to look at comment 14? Do you think that it's possible that haporxy closed the connection - the times you shared in comment 13 are close to 10s, which made me think that maybe there is a 10s timeout for haproxy getting a reply from slurmrestd? cheers, Marcin (In reply to Marcin Stolarek from comment #15) > Marco, > > Did you have time to look at comment 14? Do you think that it's possible > that haporxy closed the connection - the times you shared in comment 13 are > close to 10s, which made me think that maybe there is a 10s timeout for > haproxy getting a reply from slurmrestd? > > cheers, > Marcin Dear Marcin, timeout server is set to 1 minute. The only 10s timeouts are defined in the default section and are these: timeout connect 10s timeout check 10s I will try to verify if updating one of this will change the behavior. Kind regards, Marco Marco, >I will try to verify if updating one of this will change the behavior. Where you able to verify? Were you able to upgrade and verify if the issue still exists on the newer(comment 1) version? cheers, Marcin Marcin > > Were you able to upgrade and verify if the issue still exists on the > newer(comment 1) version? > yes I set all to 30s, but the results are exactly the same. When the answer goes trough the haproxy I get this error. Are there other sites/users that have an proxy in front of the slurmrestd ? The setup via proxy should be possible (https://slurm.schedmd.com/SLUG20/REST_API.pdf) but maybe we did something wrong. Thank you Marco >Are there other sites/users that have an proxy in front of the slurmrestd ? I assume yes, but it's not something we're normally aware of. You're correct setup with a front-end HTTP server is recommended. I think I was able to reproduce the issue in the lab environment. Just to double check all the results returned look fine, but the query coming via haproxy ends-up-with error message (similar to yours, but a little bit different): urmrestd: >error: _handle_poll_event: [[localhost]:41622] poll error: Connection reset by peer cheers, Marcin Marco, Looking into the details I've understood that it's haproxy specific. (You may want to take a look at stackoverflow[1] for other backends reporting similar error). The way to fix it is to use `option httpclose` in haproxy configuration, however, depending on haproxy version you may have to remove `http-pretend-keepalived` option since they were in conflict. At least it's what worked for me on haproxy-1.5 (CentOS7). However, I can't further comment on the impact of those on haproxy behavior (it's not our/SchedMD area of expertise). Let me know how that worked for you. cheers, Marcin [1]https://stackoverflow.com/questions/21550337/haproxy-netty-way-to-prevent-exceptions-on-connection-reset Marco, Could you please take a look at my last comment? What are your thoughts? cheers, Marcin (In reply to Marcin Stolarek from comment #22) > Marco, > > Could you please take a look at my last comment? What are your thoughts? > > cheers, > Marcin Dear Marcin, than you for the hint. After testing some options, I found that the one fixing my issue is option http-server-close I think you can close the ticket and thank you for great support. Kind regards, Marco |