Ticket 6385

Summary: acct_gather_filesystem not compatible with Lustre 2.12
Product: Slurm Reporter: Stephane Thiell <sthiell>
Component: slurmstepdAssignee: Albert Gil <albert.gil>
Status: RESOLVED FIXED QA Contact:
Severity: 3 - Medium Impact    
Priority: --- CC: alex, kilian
Version: 18.08.4   
Hardware: Linux   
OS: Linux   
Site: Stanford Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: 18.08.6 Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: Proposed patch to add support for Lustre 2.12
Revised patch for Lustre 2.12.

Description Stephane Thiell 2019-01-18 14:59:44 MST
Hi,

We're upgrading Sherlock to Lustre 2.12 and we hit the following issue:

slurmstepd: error: _read_lustre_counters: Cannot open /proc/fs/lustre/llite No such file or directory
slurmstepd: error: acct_gather_filesystem_p_get_data: Cannot read lustre counters
slurmstepd: error: _read_lustre_counters: Cannot open /proc/fs/lustre/llite No such file or directory
slurmstepd: error: acct_gather_filesystem_p_get_data: Cannot read lustre counters

It looks like in 2.12+ /proc/fs/lustre/llite was moved to /sys/fs/lustre/llite


2.12 is the latest LTS release, it would be nice if Slurm could support it.

Thanks!

Stephane
Comment 10 Kilian Cavalotti 2019-01-23 12:55:43 MST
Created attachment 8988 [details]
Proposed patch to add support for Lustre 2.12

Here's a tentative patch to add support for Lustre 2.12 (Lustre client stats moved to /sys/kernel/debug/lustre/llite).

Tested on Lustre 2.10 and Lustre 2.12.
Comment 11 Albert Gil 2019-01-24 08:19:55 MST
Thanks for the patch Killian,

We are working/reviewing in a similar patch and I will attach it here soon.
Our patch will probably also create a new parameter on acct_gather.conf, just in case.

Albert
Comment 16 Tim Wickberg 2019-01-28 20:09:42 MST
Created attachment 9035 [details]
Revised patch for Lustre 2.12.

Kilian -

I split one part off in to commit 7bd7847ff9cd and pushed that already.

Are you alright with the attached as the final patch post review? I restructured some of the code, and avoided making some of the variable name changes you did (we try to limit the changes that land on the stable branches as much as possible), but your original intent is still there, and I believe this should work correctly across both Lustre client versions.

This also fixes the one performance concern I had on reviewing your patch by storing the directory lookup result - what you'd proposed would have run opendir() + closedir() an extra time on all stats lookups which I'd rather avoid.

- Tim
Comment 17 Kilian Cavalotti 2019-01-29 10:06:05 MST
Hi Tim. 

Thanks so much for reviewing and improving that patch.

(In reply to Tim Wickberg from comment #16)
> I split one part off in to commit 7bd7847ff9cd and pushed that already.

Great!

> Are you alright with the attached as the final patch post review? I
> restructured some of the code, and avoided making some of the variable name
> changes you did (we try to limit the changes that land on the stable
> branches as much as possible), but your original intent is still there, and
> I believe this should work correctly across both Lustre client versions.
>
> This also fixes the one performance concern I had on reviewing your patch by
> storing the directory lookup result - what you'd proposed would have run
> opendir() + closedir() an extra time on all stats lookups which I'd rather
> avoid.

Absolutely, that looks much better, actually. 
So, yes, I'm good with your proposed change, thanks again for the review!

Cheers,
-- 
Kilian
Comment 20 Albert Gil 2019-02-11 03:49:39 MST
Stephane,
I'm closing this bug as fixed because Killian's patch has been committed by Tim in branch slurm-18.08 and will be released in 18.08.6:

commit f2feb6e2fd78a05f94cddea73519f05782944d69
Author: Kilian Cavalotti <kilian@stanford.edu>
Date:   Mon Jan 28 20:02:10 2019 -0700

    Support Lustre 2.12 change to stats location.

    Lustre client stats moved to /sys/kernel/debug/lustre/llite with 2.12.

    Bug 6385.

    Co-authored-by: Tim Wickberg <tim@schedmd.com>

Regards,
Albert
Comment 21 Kilian Cavalotti 2019-02-11 08:42:46 MST
(In reply to Albert Gil from comment #20)
> Stephane,
> I'm closing this bug as fixed because Killian's patch has been committed by
> Tim in branch slurm-18.08 and will be released in 18.08.6:
> 
> commit f2feb6e2fd78a05f94cddea73519f05782944d69
> Author: Kilian Cavalotti <kilian@stanford.edu>
> Date:   Mon Jan 28 20:02:10 2019 -0700
> 
>     Support Lustre 2.12 change to stats location.
> 
>     Lustre client stats moved to /sys/kernel/debug/lustre/llite with 2.12.
> 
>     Bug 6385.
> 
>     Co-authored-by: Tim Wickberg <tim@schedmd.com>
> 
> Regards,
> Albert

Thank you!

Cheers,
-- 
Kilian