Ticket 13545

Summary:	node_features helpers binaries called on periodic node registration
Product:	Slurm	Reporter:	Felix Abecassis <fabecassis>
Component:	slurmd	Assignee:	Carlos Tripiana Montes <tripiana>
Status:	RESOLVED FIXED	QA Contact:
Severity:	3 - Medium Impact
Priority:	---	CC:	brian, dwightman, jbernauer, jbooth, lyeager, marshall
Version:	21.08.5
Hardware:	Linux
OS:	Linux
See Also:	https://bugs.schedmd.com/show_bug.cgi?id=12288 https://bugs.schedmd.com/show_bug.cgi?id=13568
Site:	NVIDIA (PSLA)	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:	22.05.0pre1
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---

Description Felix Abecassis 2022-03-01 16:59:33 MST

On a cluster running the initial node_features/helper plugin (before it was upstream), we noticed that the helper binaries are being called regularly, even in the middle of user jobs. The behavior is also present with the upstream node_features/helpers plugin in 21.08.6.

This is potentially a performance problem as the helpers binaries might not be lightweight, e.g. they might query BIOS settings, call nvidia-smi, etc. Running these scripts concurrently to user jobs could cause undesired jitter for those applications.


To reproduce:
- Decrease MAX_REG_FREQUENCY (https://github.com/SchedMD/slurm/blob/slurm-21-08-6-1/src/slurmctld/ping_nodes.c#L55) to 4
- Set "SlurmdTimeout=60" to get a ping interval of 20s (https://github.com/SchedMD/slurm/blob/slurm-21-08-6-1/src/slurmctld/controller.c#L1958)
- Set "NodeFeaturesPlugins=node_features/helpers" and enable any number of helper binaries.
- Launch slurmd in verbose mode

The Slurmd log will show that features are being queried repeatedly, meaning that the helper binaries are regularly called:
```
slurmd: node_features/helpers: node_features_p_node_state: original: avail=(null) current=(null)
slurmd: node_features/helpers: node_features_p_node_state: new: avail=nps=4,nps=2,nps=1,mig=off,mig=on current=nps=4,mig=on
slurmd: node_features/helpers: node_features_p_node_state: original: avail=(null) current=(null)
slurmd: node_features/helpers: node_features_p_node_state: new: avail=nps=4,nps=2,nps=1,mig=off,mig=on current=nps=4,mig=on
```

To avoid/limit performance impact on user applications, I can see a few possible options:
- Make MAX_REG_FREQUENCY configurable so that the value can be increased (doesn't really eliminate the problem).
- Do not send a registration request if a node is not idle.
- Have the node_features internal code only request the current state only once: when Slurmd is started. Since all features require a reboot for now (https://bugs.schedmd.com/show_bug.cgi?id=12288 is still pending), the state of *all* features can be safely cached in memory.
- Have the node_features/helpers code call the binaries only once, for the same reasons than the previous point.


cc Marshall as he looked at https://bugs.schedmd.com/show_bug.cgi?id=12993 recently

Comment 3 Carlos Tripiana Montes 2022-03-03 02:17:18 MST

Dear Felix,

I'm pretty sure the helpers plugin has always worked this way (I need to check others). slurmd has always told slurmctld its dynamic features via registration RPC.

I believe increasing SlurmdTimeout will increase the ping interval which in turn increases the amount of time in between registration messages, so that's a potential workaround.

You could also use CoreSpecCount/CpuSpecList to make sure that slurmd/slurmstepd never run on the same CPUs as the jobs. You would have to use the task/cgroup plugin, but I think you are. I think any scripts run by slurmd/slurmstepd (prolog/epilog, helpers, etc.) should be constrained by this cgroup as well, but I need to double check that.

Regarding the issue itself I have some ideas however, I will need some time to weigh the pros and cons and discuss this internally. I see this like an enhancement somehow as well, so I need to double check that as well.

I'll keep you posted,
Carlos.

Comment 5 Felix Abecassis 2022-03-03 09:28:05 MST

> I'm pretty sure the helpers plugin has always worked this way (I need to check others). slurmd has always told slurmctld its dynamic features via registration RPC.

True, but node_features/helpers was added in 21.08 and before that the only existing plugin was for KNL. So perhaps we can call it an oversight from my part when designing the new node_features/helpers in https://bugs.schedmd.com/show_bug.cgi?id=9567, but I would qualify it as a minor performance bug.

> I believe increasing SlurmdTimeout will increase the ping interval which in turn increases the amount of time in between registration messages

Like increasing MAX_REG_FREQUENCY, this doesn't really eliminate the problem, unfortunately.

> You could also use CoreSpecCount/CpuSpecList to make sure that slurmd/slurmstepd never run on the same CPUs as the jobs. 

This is not enough to prevent possible performance interference to the user applications, so we can't use this approach.

Comment 8 Carlos Tripiana Montes 2022-03-04 01:05:33 MST

Hi Felix,

I do see your point here, just give me some time to evaluate the best way to tackle the issue.

It might collide with "scontrol reconfig", or "scontrol update [...]" for avail/active features as well. So, if we decide to not issue the script every registration RPC, we need at least to issue the script if the features change.

Anyway, I think the idea of not executing the script every time is the right direction. What we need is to see how to accomplish this. My view of things is that somehow in the future we will have Bug 12288 implemented and it's best to address the issue here in a general way.

I think the reboot of nodes is well controlled now. So now we want to see how we can control when to issue the helper, maybe by comparing actual/new active/avail features when a potential modification is triggered, and issue the helper *only* there?

Everything I'm saying is just a conceptual explanation, and I need to go down the code to see if this can be done but, what do you think of?

Thanks,
Carlos.

Comment 9 Carlos Tripiana Montes 2022-03-04 04:05:52 MST

Being explicit, looking at the source, "node_features_p_node_state" could be modified to check that if "avail_modes" are the same as last call don't trigger "list_for_each(helper_features, _foreach_helper_get_modes, &args)".

The way I see this possible, and thread safe, is to add a:

static char** prev_avail_modes = NULL;

as part of the local variables for "node_features_p_node_state". And place there the "avail_modes" for the call at the end to let the next call check the previous values.

I still need to check if this approach works, but it seems like a potential solution.

Comment 10 Carlos Tripiana Montes 2022-03-04 06:34:27 MST

Ahh (I think) I see the point now.

1. It always uses _feature_get_state to directly query HW about the real active features. This way Slurm do not cache anything and if a feature is changed outside Slurm, Slurm will be aware of anyways.

2. But if we go and create some sort of cache for features and avoid usage of _feature_get_state, then we can end up with problems if 1. happens.

3. I think a possibility is to cache active features in _feature_set_state, disregard the 1st approach in Comment 9, and use _feature_get_state only if slurmd starts or scontrol reconfig/update is issued. And document appropriately that now features can't be modified outside Slurm.

It still remains to check how to detect scontrol reconfig/update, slurmd is pretty simple because the cache variables should be null at this point.

Comment 12 Felix Abecassis 2022-03-04 10:29:01 MST

Yes, I suppose that one way of supporting "rebootless" node features changes today would be to modify the state of the node outside of Slurm, and then issue a "scontrol update" (or reconfigure). I think it's fine if all state changes must be under Slurm control instead, as long as we still leave the door open for rebootless node features changes to be implemented in a future version of Slurm.


Unrelated to node_features but related to the performance issue here, do you know if GRES autodetection will also be triggered again on periodic node re-registration? Having many calls to NVML for finding available GPUs could also impact user jobs.

Comment 13 Carlos Tripiana Montes 2022-03-06 09:26:09 MST

> do you know if GRES autodetection will also be triggered again on periodic node re-registration?

AFAIU no, only when the config is generated for gpu.c or gres.c. When used in gres_gpu.c, it only queries node_flags to know the bits set.

Comment 19 Carlos Tripiana Montes 2022-03-14 05:04:43 MDT

Felix,

We're actively working in this improvement and now it's being reviewed.

I'll let you know once it's released.

Regards,
Carlos.

Comment 21 Felix Abecassis 2022-03-14 15:49:46 MDT

Got it, thanks for the update!

Comment 31 Carlos Tripiana Montes 2022-04-18 04:34:23 MDT

Felix,

Still working in this improvement. It's not stalled.

Regards,
Carlos.

Comment 33 Carlos Tripiana Montes 2022-04-20 10:23:03 MDT

Felix,

Happy to say fixes have been pushed ahead in master:

*   a298b1d46f (HEAD -> master, origin/master, origin/HEAD) Merge branch 'bug13545'
|\  
| * afa04f5608 (bug13545) Adding NEWS entry for the last 2 commits
| * 1140899487 slurmd - Refresh features cache if scontrol reconfig
| * 0da310d444 slurmd - Cache features at startup
|/

Going to mark the bug as resolved.

Cheers,
Carlos.

Comment 34 Felix Abecassis 2022-04-22 11:06:02 MDT

I confirm that it seems fixed on master, thank you!