Ticket 4793

Summary: Default SyscfgTimeout too low for Dell nodes
Product: Slurm Reporter: Christopher Samuel <chris>
Component: KNLAssignee: Director of Support <support>
Status: RESOLVED FIXED QA Contact:
Severity: 4 - Minor Issue    
Priority: ---    
Version: 17.11.3   
Hardware: Linux   
OS: Linux   
Site: Swinburne Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: 17.11 Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: knl_man_page.patch
knl_web_page.patch

Description Christopher Samuel 2018-02-14 19:32:37 MST
Hi there,

On our Dell KNL boxes (PowerEdge C6320p) we find the default timeout for syscfg is way to low, it takes almost 3 seconds (measured with "time") for each run of syscfg commands to find the information it needs.

We've had to configure:

SyscfgTimeout=4000

but it might be handy to adjust that timeout up when SystemType=Dell is specified.

Hope this helps!

All the best,
Chris
Comment 2 Isaac Hartung 2018-02-15 12:08:00 MST
Hi Chris,

There isn't a suitable mechanism in place to base defaults, such as syscfgtimeout, on type.

However we will update the documentation of the current version to advise on this issue.

Regards
Comment 3 Christopher Samuel 2018-02-15 14:56:01 MST
Hi Isaac,

On 16/02/18 06:08, bugs@schedmd.com wrote:

> There isn't a suitable mechanism in place to base defaults, such as 
> syscfgtimeout, on type.
> 
> However we will update the documentation of the current version to
> advise on this issue.

That's all good, thanks for letting me know!

All the best,
Chris
Comment 8 Isaac Hartung 2018-02-21 09:05:53 MST
Christopher,

We have updated the knl.conf manpage in commit
c543cec33a1565ea5261ed3adb82ccc621905c43.  I am closing this ticket, but should you have any further, related issues, please reopen it.

Regards
Comment 9 Christopher Samuel 2018-02-21 15:34:39 MST
Hi Isaac,

On 22/02/18 03:05, bugs@schedmd.com wrote:

> We have updated the knl.conf manpage in commit 
> c543cec33a1565ea5261ed3adb82ccc621905c43.  I am closing this ticket,
> but should you have any further, related issues, please reopen it.

I meant to update this ticket yesterday I'm afraid, but didn't
have time in the end. I wanted to let you know we were still
seeing occasional timeouts with a 4 second limit so I had to
increase it to 10 seconds to try and make it more resilient.

I've just done a pull request against the 17.11 branch for the
manual page and the web page for KNL as penance for not having
time to get back to you sooner! :-)

https://github.com/SchedMD/slurm/pull/166

All the best,
Chris
Comment 12 Christopher Samuel 2018-02-21 15:38:15 MST
Created attachment 6213 [details]
knl_man_page.patch

On 22/02/18 09:34, Christopher Samuel wrote:

> I've just done a pull request against the 17.11 branch for the
> manual page and the web page for KNL as penance for not having
> time to get back to you sooner! :-)

Just saw from Danny that you don't accept PR's on Github. :-(

Here are the patches as attachments instead.

cheers,
Chris
Comment 13 Christopher Samuel 2018-02-21 15:38:16 MST
Created attachment 6214 [details]
knl_web_page.patch
Comment 14 Christopher Samuel 2018-02-21 15:42:05 MST
Just reopening to add the patches I emailed through.
Comment 15 Isaac Hartung 2018-02-21 15:44:24 MST
Ok, we'll take a look at these and let you know.  Thanks Chris!
Comment 17 Isaac Hartung 2018-02-21 15:49:31 MST
Hi Chris,

You're patch has been committed: 356f1e40b0a130c6a5baebea37d8aa9c4890c760.

Thanks again!

Regards,

Isaac
Comment 18 Christopher Samuel 2018-02-21 15:56:09 MST
On 22/02/18 09:49, bugs@schedmd.com wrote:

> You're patch has been committed: 356f1e40b0a130c6a5baebea37d8aa9c4890c760.

Thanks Isaac, very kind!

All the best,
Chris