Ticket 7566 - Make Infiniband and Lustre Accounting Node Specific
Summary: Make Infiniband and Lustre Accounting Node Specific
Status: OPEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: Accounting (show other tickets)
Version: 19.05.1
Hardware: Linux Linux
: 5 - Enhancement
Assignee: Unassigned Developer
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2019-08-13 09:34 MDT by Paul Edmon
Modified: 2019-08-28 16:17 MDT (History)
2 users (show)

See Also:
Site: Harvard University
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Paul Edmon 2019-08-13 09:34:28 MDT
Currently both Infiniband and Lustre usage tracking a global switches you turn on or off.  When they are on they expect every node to have IB and Lustre, which isn't always the case (especially in our environment).  It would be good to have Slurm either autodetect if IB and Lustre is present or not or to have a switch in acct_gather.conf to inform the scheduler if IB and/or Lustre is present.  Currently if they aren't slurmd still tries to poll for them which then causes it to generate errors and warnings which pollute user logs and slow down their jobs.
Comment 2 Jason Booth 2019-08-13 11:40:31 MDT
Hi Paul. Thanks for logging this issue with us. There is some enhancement work involved to make these configurable on a per-node basis. Is this something your site is interested in sponsoring development for?

As a workaround you could try setting the following:

JobAcctGatherFrequency=[filesystem|network]=0 , which according to the documentation:

"An interval of 0 disables sampling of the specified type.  If the task sampling interval is 0, accounting information is collected only at job termination (reducing Slurm interference with the job)."
Comment 3 Paul Edmon 2019-08-13 11:44:54 MDT
Right now we just removed those completely from our config so they 
aren't even polling anymore.

As for sponsoring, as it stands not right now.  I know we have a number 
of pending feature requests, some of which are more broadly applicable 
to the community and have significant interest.  Right now we are just 
dropping these in here so that they may be done for the sake of general 
improvement as you or others have time. Given the number of outstanding 
requests we have (about 20+ at this point) we may look at sponsoring 
some work in the future, but I will need to talk it over with my 
management first.  So as it stands do not expect anything from us other 
than just providing these as suggestions for future improvement that 
would greatly aide us and the community.

-Paul Edmon-

On 8/13/19 1:40 PM, bugs@schedmd.com wrote:
>
> *Comment # 2 <https://bugs.schedmd.com/show_bug.cgi?id=7566#c2> on bug 
> 7566 <https://bugs.schedmd.com/show_bug.cgi?id=7566> from Jason Booth 
> <mailto:jbooth@schedmd.com> *
> Hi Paul. Thanks for logging this issue with us. There is some enhancement work
> involved to make these configurable on a per-node basis. Is this something your
> site is interested in sponsoring development for?
>
> As a workaround you could try setting the following:
>
> JobAcctGatherFrequency=[filesystem|network]=0 , which according to the
> documentation:
>
> "An interval of 0 disables sampling of the specified type.  If the task
> sampling interval is 0, accounting information is collected only at job
> termination (reducing Slurm interference with the job)."
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>
Comment 4 Jason Booth 2019-08-13 11:56:46 MDT
Thanks, Paul - converting this over to an enhancement.