Ticket 11415

Summary:	Error messages in the logs
Product:	Slurm	Reporter:	Marco Induni <marco.induni>
Component:	slurmrestd	Assignee:	Marcin Stolarek <cinek>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	4 - Minor Issue
Priority:	---	CC:	cinek, nate
Version:	20.11.4
Hardware:	Linux
OS:	Linux
Site:	CSCS - Swiss National Supercomputing Centre	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	Strace dump Network log during an API request

Description Marco Induni 2021-04-20 03:23:51 MDT

Dear Support,
I've noticed that every time I do a query like this one 

curl -s -H X-SLURM-USER-NAME:minduni -H X-SLURM-USER-TOKEN:$SLURM_JWT -X GET 'http://slurmdb.cscs.ch:6822/slurm/v0.0.36/jobs'

I've got the results, but also this message in the log file

Apr 20 11:17:17 fojorina05 slurmrestd: slurmrestd: error: _handle_poll_event: [[fojorina04.cscs.ch]:44254] poll error: Unspecified error
Apr 20 11:17:17 fojorina05 slurmrestd[16223]: error: _handle_poll_event: [[fojorina04.cscs.ch]:44254] poll error: Unspecified error

Is something should I care or I can ignore it ?

Thank you for any advice.

Best regards,
Marco

Comment 1 Nate Rini 2021-04-20 08:34:45 MDT

(In reply to Marco Induni from comment #0)
> Apr 20 11:17:17 fojorina05 slurmrestd: slurmrestd: error:
> _handle_poll_event: [[fojorina04.cscs.ch]:44254] poll error: Unspecified
> error
> Apr 20 11:17:17 fojorina05 slurmrestd[16223]: error: _handle_poll_event:
> [[fojorina04.cscs.ch]:44254] poll error: Unspecified error
The incorrect error is getting dumped here. This has already been fixed in bug#11098 and will be included in 20.11.6.

> Is something should I care or I can ignore it ?
Generally, if poll() is returning an error then something happened to the connection outside of Slurm's control.

Since the patchset for bug#11098 is a little large, I rather not suggest just patching it in. Is it possible to run slurmrestd with strace instead and provide that output?

Comment 2 Marco Induni 2021-04-20 09:01:51 MDT

Created attachment 19048 [details]
Strace dump

Dear Nate,
attached the strace dump of the running process while I did the query and get the error.

Apparently seems that this error is thrown out when the query is executed not directly to the node, but trough a proxy server.

Could that be an explanation (I'm still testing if this is always the case)

Kind regards,
Marco

Comment 3 Nate Rini 2021-04-20 15:31:00 MDT

(In reply to Marco Induni from comment #2)
> Created attachment 19048 [details]
> Strace dump
> 16236 <... poll resumed>)               = 1 ([{fd=8, revents=POLLIN|POLLERR|POLLHUP}])

The strace log shows that the connection was closed (probably reset) with incoming data.
 
> Apparently seems that this error is thrown out when the query is executed
> not directly to the node, but trough a proxy server.
> 
> Could that be an explanation (I'm still testing if this is always the case)
It's possible the proxy errored out and closed the connection early. Was there an error in the proxy's logs?

Comment 4 Marco Induni 2021-04-21 02:24:33 MDT

Hi,
> It's possible the proxy errored out and closed the connection early. Was
> there an error in the proxy's logs?
no error found in the log

Marco

Comment 7 Marcin Stolarek 2021-04-22 01:21:25 MDT

Marco,

Are we talking about a testing/non-production environment? If yes please enable DebugFlags=Network and share the logs from the time you reproduce the issue.

What is the proxy server? How long does it take to get the result of the query when you use curl directly against slurmrestd (this works fine?)?

cheers,
Marcin

Comment 8 Marco Induni 2021-04-22 08:59:03 MDT

(In reply to Marcin Stolarek from comment #7)
Hi Marcin,
> 
> enable DebugFlags=Network and share the logs from the time you reproduce the
> issue.
> 
Attached you will find the log

> What is the proxy server? How long does it take to get the result of the
> query when you use curl directly against slurmrestd (this works fine?)?
> 
The time doesn't change much, and we use haproxy 1.8.26

Thank you
Marco

Comment 9 Marco Induni 2021-04-22 09:00:23 MDT

Created attachment 19086 [details]
Network log  during an API request

Comment 11 Marcin Stolarek 2021-04-26 00:34:49 MDT

>> What is the proxy server? How long does it take to get the result of the
>> query when you use curl directly against slurmrestd (this works fine?)?
>> 
>The time doesn't change much, and we use haproxy 1.8.26

Thanks, but could you please answer the initial question - "How long does it take to get the result of the query when you use curl directly against slurmrestd", for instance, wrapping curl execution by the `time` command?

cheers,
Marcin

Comment 12 Marcin Stolarek 2021-05-04 05:29:48 MDT

Marco,

 Could you please take a look at comment 11.

 The debug logs you shared show that we got a combination of POLLIN|POLLHUP|POLLERR  flags, based on glibc documentation of POLLHUP:

> POLLHUP
>              Hang up (only returned in revents; ignored in events).  Note that 
>when reading from a channel such as a pipe or a stream socket, this event merely 
>indicates that the peer closed its end of the channel.  Sub‐
>sequent reads from the channel will return 0 (end of file) only after all 
>outstanding data in the channel has been consumed.

which suggests that connection was closed by (timed-out - that's why I'm intereted in time needed to get the reply directly from slurmrestd) from the HTTP proxy side.

cheers,
Marcin

Comment 13 Marco Induni 2021-05-04 06:28:02 MDT

Dear Marcin,
here the timings as requeted:

via proxy 
time curl -s -H X-SLURM-USER-NAME:minduni -H X-SLURM-USER-TOKEN:$SLURM_JWT -X GET 'http://slurmdb.cscs.ch:6822/slurm/v0.0.36/jobs'

real	0m10,198s
user	0m0,012s
sys	0m1,742s

via slurmrestd
time curl -s -H X-SLURM-USER-NAME:minduni -H X-SLURM-USER-TOKEN:$SLURM_JWT -X GET 'http://fojorina04.cscs.ch:6821/slurm/v0.0.36/jobs'

real	0m10,528s
user	0m0,000s
sys	0m1,725s


Thank you

Kind regards,
Marco


(In reply to Marcin Stolarek from comment #12)
> Marco,
> 
>  Could you please take a look at comment 11.
> 
>  The debug logs you shared show that we got a combination of
> POLLIN|POLLHUP|POLLERR  flags, based on glibc documentation of POLLHUP:
> 
> > POLLHUP
> >              Hang up (only returned in revents; ignored in events).  Note that 
> >when reading from a channel such as a pipe or a stream socket, this event merely 
> >indicates that the peer closed its end of the channel.  Sub‐
> >sequent reads from the channel will return 0 (end of file) only after all 
> >outstanding data in the channel has been consumed.
> 
> which suggests that connection was closed by (timed-out - that's why I'm
> intereted in time needed to get the reply directly from slurmrestd) from the
> HTTP proxy side.
> 
> cheers,
> Marcin

Comment 14 Marcin Stolarek 2021-05-13 04:03:42 MDT

What is the value of `timeout server` in your haproxy configuration?

Looking at the code the only reason I can find is that second end(haproxy) closed the connection.

cheers,
Marcin

Comment 15 Marcin Stolarek 2021-05-24 03:36:40 MDT

Marco,

Did you have time to look at comment 14? Do you think that it's possible that haporxy closed the connection - the times you shared in comment 13 are close to 10s, which made me think that maybe there is a 10s timeout for haproxy getting a reply from slurmrestd?

cheers,
Marcin

Comment 16 Marco Induni 2021-05-25 00:22:52 MDT

(In reply to Marcin Stolarek from comment #15)
> Marco,
> 
> Did you have time to look at comment 14? Do you think that it's possible
> that haporxy closed the connection - the times you shared in comment 13 are
> close to 10s, which made me think that maybe there is a 10s timeout for
> haproxy getting a reply from slurmrestd?
> 
> cheers,
> Marcin

Dear Marcin,
timeout server is set to 1 minute. 
The only 10s timeouts are defined in the default section and are these:

timeout  connect 10s
timeout  check 10s

I will try to verify if updating one of this will change the behavior.

Kind regards,
Marco

Comment 17 Marcin Stolarek 2021-06-08 05:04:06 MDT

Marco,

>I will try to verify if updating one of this will change the behavior.
Where you able to verify?

Were you able to upgrade and verify if the issue still exists on the newer(comment 1) version?

cheers,
Marcin

Comment 18 Marco Induni 2021-06-11 06:14:48 MDT

Marcin
> 
> Were you able to upgrade and verify if the issue still exists on the
> newer(comment 1) version?
> 
yes I set all to 30s, but the results are exactly the same.

When the answer goes trough the haproxy I get this error.

Are there other sites/users that have an proxy in front of the slurmrestd ? 
The setup via proxy should be possible (https://slurm.schedmd.com/SLUG20/REST_API.pdf) but maybe we did something wrong.


Thank you
Marco

Comment 19 Marcin Stolarek 2021-06-15 07:40:13 MDT

>Are there other sites/users that have an proxy in front of the slurmrestd ?
I assume yes, but it's not something we're normally aware of. You're correct setup with a front-end HTTP server is recommended.

I think I was able to reproduce the issue in the lab environment.

Just to double check all the results returned look fine, but the query coming via haproxy ends-up-with error message (similar to yours, but a little bit different):
urmrestd:
>error: _handle_poll_event: [[localhost]:41622] poll error: Connection reset by peer


cheers,
Marcin

Comment 21 Marcin Stolarek 2021-06-23 01:23:28 MDT

Marco,

Looking into the details I've understood that it's haproxy specific. (You may want to take a look at stackoverflow[1] for other backends reporting similar error).

The way to fix it is to use `option httpclose` in haproxy configuration, however, depending on haproxy version you may have to remove `http-pretend-keepalived` option since they were in conflict.
At least it's what worked for me on haproxy-1.5 (CentOS7). However, I can't further comment on the impact of those on haproxy behavior (it's not our/SchedMD area of expertise). 

Let me know how that worked for you.

cheers,
Marcin

[1]https://stackoverflow.com/questions/21550337/haproxy-netty-way-to-prevent-exceptions-on-connection-reset

Comment 22 Marcin Stolarek 2021-07-02 05:33:20 MDT

 Marco,

Could you please take a look at my last comment? What are your thoughts?

cheers,
Marcin

Comment 23 Marco Induni 2021-07-07 02:18:16 MDT

(In reply to Marcin Stolarek from comment #22)
>  Marco,
> 
> Could you please take a look at my last comment? What are your thoughts?
> 
> cheers,
> Marcin

Dear Marcin,
than you for the hint. 
After testing some options, I found that the one fixing my issue is 
option http-server-close 

I think you can close the ticket and thank you for great support.

Kind regards,
Marco