Ticket 9567

Summary: Contribution: generic node_features plugin
Product: Slurm Reporter: Felix Abecassis <fabecassis>
Component: OtherAssignee: Tim Wickberg <tim>
Status: RESOLVED FIXED QA Contact:
Severity: C - Contributions    
Priority: --- CC: ben, dwightman, jbernauer, jess, lyeager, sts
Version: 20.11.x   
Hardware: Linux   
OS: Linux   
See Also: https://bugs.schedmd.com/show_bug.cgi?id=10707
https://bugs.schedmd.com/show_bug.cgi?id=11182
https://bugs.schedmd.com/show_bug.cgi?id=12068
https://bugs.schedmd.com/show_bug.cgi?id=12251
Site: NVIDIA (PSLA) Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed: 21.08.0rc1
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---
Attachments: node_features/helpers v1
node_features/helpers v2
node_features/helpers v3
Design Document
Patch
ExecTime Patch

Description Felix Abecassis 2020-08-11 15:16:15 MDT
Created attachment 15401 [details]
node_features/helpers v1

SchedMD team, 

As we discussed recently, here is a starting point for a new type of node_features plugin (https://slurm.schedmd.com/node_features_plugins.html) that defers the main logic to external binaries called "helpers" (as a nod to the usermode-helper API in the Linux kernel). This is a long message, so let me know if you prefer to use another approach (like a Google doc) to discuss the design.


The existing node_features plugins are for KNL, and adding new plugins requires writing C code and duplicating logic already present in the KNL plugins. As those plugins are in-tree, it also requires recompiling and reinstalling Slurm.
This process has a huge overhead, and on our cluster we would like to experiment with new dynamic node features on a more frequent basis without having to reinstall Slurm.

One example is being able to reboot DGX A100 nodes to a different NUMA Per Socket (NPS) mode, as described in https://bugs.schedmd.com/show_bug.cgi?id=9431#c0
Another example would be to modify Linux kernel boot parameters, being able to reboot to a different kernel, or reboot to a different version of the NVIDIA device driver.

Obviously, we don't want to write ad-hoc plugins for each case. The requirements behind each are pretty much identical:
1. Each feature must have only one active state.
2. We need a way to get the current feature state on a node.
3. We need a way to change the feature before the reboot is triggered.
4. We need Slurm to keep track of the node's state regarding each feature.
5. We need Slurm to favor scheduling to nodes with the right state already.

To solve 1., a simple way to handle mutually exclusive options (like NPS4 and NPS1), is to have the features expressed in a key-value form:
$ sinfo -a -o "%16N | %16b | %f"
NODELIST         | ACTIVE_FEATURES  | AVAIL_FEATURES
node-[1-2]       | nps=4            | nps=1,nps=2,nps=4
This way, you can't have "nps=4" and "nps=1" at the same time. This would have been more difficult to express without a key-value format (e.g. "nps4" and "nps1").


The solution requirements 2. and 3. is to rely on external binaries to handle the feature-specific detection/change. For each feature, a line must be added in a new configuration file node_features.conf, pointing to the helper binary/script:
Feature=nps Helper=/etc/slurm/node_features_helpers/rome
The helper can be a bash script, a Python script, a C program, anything. As long as it follows the simple API below.
Called with no arguments, the helper binary must list all the key-value pairs it support (possibly for multiple features):
$ /etc/slurm/node_features_helpers/rome 
nps=1
nps=2
nps=4
tdp=default
tdp=240

Called with a key as an argument, the helper must return the state of the system (it can be done by querying sysfs or procfs, by maintaining a state file, querying a kernel module...):
$ /etc/slurm/node_features_helpers/rome nps
nps=4

Last but not least, to change an active feature, the helper must be called with the desired key-value pair:
$ /etc/slurm-llnl/node_features_helpers/amd_rome nps=1
Then slurmd will take care of the reboot.

And that's all, a key-value format handled by custom external binaries is the gist of this proposal. To match the KNL plugin there are also configuration options to tweak boot time and node reboot weight, but it's straightforward.
The user-facing API is no different than using static constraints declared at the node level:
$ srun --constraint nps=1 ...


Of course, there are many design choices that were made to reach the current state, so I'm going to list them below, to prefetch the questions you might have. 
a) An important one is disabling most of the complicated DSL for constraints (see node_features_p_job_valid). We didn't see the need for it, and most of it either doesn't work or it's unclear what the plugin should be doing. For instance when using --constraint="[nps=4|nps=1]", even if a node satisfies nps=4 or nps=1, the node_features plugin will be asked to reboot and be passed "[nps=4|nps=1]". That means each plugin needs to reimplement the DSL parsing, and make a decision on what exactly to do. I think the DSL parsing should be done by Slurm and only individual features should be handled by the plugin.

b) We can't handle mutually exclusive keys right now. It's definitely possible with the current approach since a helper can handle multiple keys, it could decide to reject invalid combinations if we were to pass all key-values at once:
$ ./helper foo=1 bar=notfoo

c) node_features_p_user_update always returns "true". We can do like KNL and add a configuration option node_features_p_user_update. But it's not done yet since it would duplicate the KNL code.

d) We can't share our specific helper binary for NPS. I can either share my "fake" helper binaries that I used for development, or do a simple helper binary for a generic feature requiring a reboot (e.g. changing a Linux kernel boot parameter).


And then there are other improvements that would be useful to improve node_features plugins in general:
e) If a node_features plugin fails to change a feature, slurmd still attempts to reboot the node. It should probably drain the node.

f) After a node reboots, Slurm doesn't check if the state is the expected one. This can happen if something weird occurred during boot. We solve that today with a Slurm prolog that checks what the job requested VS the node state after boot. It should automatically drain the node too.

g) We have some features that still take a while to change, but don't require a full reboot. Using the state tracking and scheduling of Slurm could help optimize this use case too.
We could have a new API call that returns whether a reboot is needed for a given feature:
bool node_features_p_reboot_needed(char *feature) 


Let me know what you think.
Comment 1 Felix Abecassis 2020-08-11 15:22:15 MDT
BTW, we are at ~550 lines of code for this generic plugin. It could be less without having to reimplement some of the boilerplate functions.

KNL plugins are at 1700 / 2600 lines of code. Of course they don't behave exactly the same (for instance, it seems the KNL plugins modify GRES, but I guess a helper plugin could also do it through the Slurm CLI).
Comment 2 Tim Wickberg 2020-08-11 15:55:37 MDT
We prefer to stick with Bugzilla so we can come back to this readily at a later point.

I'll stick to one point in my first comment here, I'm still mulling over some other details, and I'll need to discuss some pieces further internally.

> Obviously, we don't want to write ad-hoc plugins for each case. The
> requirements behind each are pretty much identical:
> 1. Each feature must have only one active state.
> 2. We need a way to get the current feature state on a node.
> 3. We need a way to change the feature before the reboot is triggered.
> 4. We need Slurm to keep track of the node's state regarding each feature.
> 5. We need Slurm to favor scheduling to nodes with the right state already.
> 
> To solve 1., a simple way to handle mutually exclusive options (like NPS4
> and NPS1), is to have the features expressed in a key-value form:
> $ sinfo -a -o "%16N | %16b | %f"
> NODELIST         | ACTIVE_FEATURES  | AVAIL_FEATURES
> node-[1-2]       | nps=4            | nps=1,nps=2,nps=4
> This way, you can't have "nps=4" and "nps=1" at the same time. This would
> have been more difficult to express without a key-value format (e.g. "nps4"
> and "nps1").

I'm not sold on the key=value syntax just yet. Especially as you note here:

> b) We can't handle mutually exclusive keys right now. It's definitely
> possible with the current approach since a helper can handle multiple keys,
> it could decide to reject invalid combinations if we were to pass all
> key-values at once:
> $ ./helper foo=1 bar=notfoo

Which means we'd still likely need some extra configuration syntax to cover that situation anyways.

So I think I'd rather consider these features as "nps1", "nps2", "nps4" instead.

I don't have an immediate fix for this, but I'm thinking some configuration declaration that a given collection of features are mutually exclusive is called for. Dunno what exactly that syntax would look like at the moment though.
Comment 3 Felix Abecassis 2020-08-11 16:11:05 MDT
Thanks! As a single data point, we don't have a use case for mutually exclusive keys today.

The key-value approach is to get a tiny bit of structure and make it easier for users to parse and understand which values available for a feature. If we had "nps1", "nps2" and "nps4", it would not be immediately obvious (nor enforced) that they are related, and whether you can have multiple of those active at the same time.

We could also envision adding types to values, to limit the impact of misbehaving helpers. For instance NPS could be "int", NVIDIA MIG could be "bool", etc. It's not possible with strings everywhere.

It's also possible to merge keys and express combination in the values, for instance: bar=notfoo+2 or bar=notfoo;2 (depending on what Slurm would accept).
Comment 5 Felix Abecassis 2020-12-10 16:11:30 MST
Going back to this after many months, I would like to clarify something regarding point a) mentioned in the initial message: I'm not suggesting to remove the constraint DSL in Slurm, that would be a radical change that would break many valid use cases today (--constraint is often used with static node features declared in slurm.conf).

The goal of this RFE is to start a discussion regarding the current node_features API (which is just for KNL right now). It is not mergeable in the current state since we can't deprecate the constraint DSL for an optional plugin. My suggestion is to modify the node_features API to only pass features one at a time to the plugin, having plugins reimplement the DSL parsing is error prone. And not everything is implemented on the Slurm side anyway, so you would have to match the limitations of the Slurm parser / handler. Or the API should pass the full expression to the plugin in a non-ambiguous form like an abstract syntax tree.

And in some situations the plugin has no way of knowing what to do, for instance with --constraint="[nps=4|nps=1]", node_features_p_changeable_feature  on the slurmctld side is called with inputs "nps=4" and then "nps=1" (which is simple to deal with); but node_features_p_node_set on the slurmd side is called with "[nps=4|nps=1]", so the plugin doesn't know what to do here. Passing an AST wouldn't solve this problem.

When you have time to look at this RFE again, please let me know if/how you wish to proceed. I file more bug reports with the node_features API, we can decide whether the key-value format is a good thing for this plugin, we can discuss whether we want to express that some features are mutually exclusive, etc.
Comment 6 Felix Abecassis 2020-12-10 17:57:42 MST
Created attachment 17109 [details]
node_features/helpers v2
Comment 7 Felix Abecassis 2020-12-10 18:03:00 MST
Submitted v2 of the patch that uses a stopgap measure to restore the DSL when possible.

In node_features_p_job_valid, it used to consider the constraint to be invalid whenever it contained one of those operators: []()|*
Now, it also checks whether the expression contains one of the key/value handled by the plugin, but since it doesn't parse the expression it just uses strstr(3), e.g. looking for "nps=" in the string. If it does, then the constraint is considered invalid.

As a result, node_features_p_job_xlate was also modified. If one of the operators mentioned above was used, we return an empty string. This prevents Slurm from calling node_features_p_node_set with this constraint and attempting to reboot the node.
Comment 9 Douglas Wightman 2021-04-29 16:52:13 MDT
Created attachment 19223 [details]
node_features/helpers v3

Attaching an updated patch for the node features helpers plugin.

This includes everything discussed and agreed upon in the design document, including: removing the reliance on <ATTR>=<VALUE> syntax, adding the optional MutuallyExclusive setting, formatting, etc. Looking forward to feedback!
Comment 10 Douglas Wightman 2021-04-29 16:53:16 MDT
Created attachment 19224 [details]
Design Document

Attaching design document for reference.
Comment 12 Tim Wickberg 2021-05-18 17:10:09 MDT
Hi Doug -

I've been working on reviewing the patch this afternoon, and this looks pretty good so far. Functionally things appear to be in pretty good shape, and the bulk of my first pass is focused on style quirks instead.

I've pushed a new branch to our github to aid in review - bug9567 - which includes a number of trivial style changes, as well as a handful of FIXME comments that I'd like you to take a look at.

If you don't have any problems with what's there those style commits will all be squashed into your initial commit on the branch; if you notice any issue or have questions on them please ask.

As for the FIXMEs, the main notes I have and would appreciate some following patches for are:
- Please avoid using ListIterator. It's quite slow compared to list_for_each / list_find_first, and we've been trying to avoid it in newer code.
- Please look at using run_command() from our common library rather than your local reimplementation.

Also, I did rename the config file to "helpers.conf" to match the plugin name. If you have some other preference please let me know.

- Tim
Comment 13 Douglas Wightman 2021-05-18 17:20:26 MDT
Thanks Tim, I'll start work on addressing those FIXMEs tomorrow.
Comment 14 Douglas Wightman 2021-05-21 08:49:05 MDT
Created attachment 19604 [details]
Patch

Tim,

The attached patch addresses review feedback and FIXME items:

- Use standard run_command()
- Remove use of ListIterator
- Strict error handling of conf parsing failures (previously ignored duplicate entries)

Let me know if you have any issues or concerns with this latest update.
Comment 15 Tim Wickberg 2021-07-19 19:04:29 MDT
Hey Doug -

This is on master now, and will be included in 21.08 when released next month. I did squash all the fixup work and some additional style tweaks into this single commit, but functionally this is unchanged from the original submission.

I'm going to move ahead and mark this resolved. I'll split off test development (provided by SchedMD per the contract) into a new issue to track that, but I don't expect any issues there.

- Tim

commit f8b2c358b2baffceddfb59c97c0f56bd647427a6
Author:     Douglas Wightman <dwightman@nvidia.com>
AuthorDate: Thu Apr 29 09:05:38 2021 -0600

    Add node_features/helpers plugin.
    
    A new node-features plugin for managing dynamic features through the use
    of helper binaries. These binaries are configured via the helpers.conf
    file, along with other plugin-specific options such as MutuallyExclusive.
    
    Bug 9567.
    
    Co-authored-by: Felix Abecassis <fabecassis@nvidia.com>
Comment 16 Douglas Wightman 2021-08-20 14:15:42 MDT
Created attachment 20965 [details]
ExecTime Patch

Hi Tim, I'm attaching a small patch with 2 changes. (1) Allow admins to configure ExecTime, a time limit for the helpers, and (2) a simple README that explains how to configure and use the plugin. If there are no concerns with these changes please merge them at your earliest convenience, perhaps 21.08.1? Thanks!
Comment 17 Tim Wickberg 2021-08-20 23:49:34 MDT
Thanks Doug.

I've added the configuration option with the following commit, it's in ahead of 21.08.0.

READMEs aren't how we like to document things at this point in time. I'll ask Ben to take a look into getting some proper documentation put together based off that patch though.

- Tim

commit 0ee9df12503baeedde255256de14785bd4d1cd63
Author:     Douglas Wightman <dwightman@nvidia.com>
AuthorDate: Fri Aug 20 23:45:28 2021 -0600

    node_features/helpers - add ExecTime option.
    
    Bug 9567.