Currently both Infiniband and Lustre usage tracking a global switches you turn on or off. When they are on they expect every node to have IB and Lustre, which isn't always the case (especially in our environment). It would be good to have Slurm either autodetect if IB and Lustre is present or not or to have a switch in acct_gather.conf to inform the scheduler if IB and/or Lustre is present. Currently if they aren't slurmd still tries to poll for them which then causes it to generate errors and warnings which pollute user logs and slow down their jobs.
Hi Paul. Thanks for logging this issue with us. There is some enhancement work involved to make these configurable on a per-node basis. Is this something your site is interested in sponsoring development for? As a workaround you could try setting the following: JobAcctGatherFrequency=[filesystem|network]=0 , which according to the documentation: "An interval of 0 disables sampling of the specified type. If the task sampling interval is 0, accounting information is collected only at job termination (reducing Slurm interference with the job)."
Right now we just removed those completely from our config so they aren't even polling anymore. As for sponsoring, as it stands not right now. I know we have a number of pending feature requests, some of which are more broadly applicable to the community and have significant interest. Right now we are just dropping these in here so that they may be done for the sake of general improvement as you or others have time. Given the number of outstanding requests we have (about 20+ at this point) we may look at sponsoring some work in the future, but I will need to talk it over with my management first. So as it stands do not expect anything from us other than just providing these as suggestions for future improvement that would greatly aide us and the community. -Paul Edmon- On 8/13/19 1:40 PM, bugs@schedmd.com wrote: > > *Comment # 2 <https://bugs.schedmd.com/show_bug.cgi?id=7566#c2> on bug > 7566 <https://bugs.schedmd.com/show_bug.cgi?id=7566> from Jason Booth > <mailto:jbooth@schedmd.com> * > Hi Paul. Thanks for logging this issue with us. There is some enhancement work > involved to make these configurable on a per-node basis. Is this something your > site is interested in sponsoring development for? > > As a workaround you could try setting the following: > > JobAcctGatherFrequency=[filesystem|network]=0 , which according to the > documentation: > > "An interval of 0 disables sampling of the specified type. If the task > sampling interval is 0, accounting information is collected only at job > termination (reducing Slurm interference with the job)." > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. >
Thanks, Paul - converting this over to an enhancement.