Ticket 12288 - RFE: support rebootless node_features/helpers features changes
Summary: RFE: support rebootless node_features/helpers features changes
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmd (show other tickets)
Version: 21.08.x
Hardware: Linux Linux
: 5 - Enhancement
Assignee: Marcin Stolarek
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2021-08-16 16:05 MDT by Felix Abecassis
Modified: 2024-04-18 00:47 MDT (History)
4 users (show)

See Also:
Site: NVIDIA (PSLA)
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 24.05.0rc1
Target Release: 24.05
DevPrio: 1 - Paid
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Felix Abecassis 2021-08-16 16:05:34 MDT
As mentioned in item g) from the original design https://bugs.schedmd.com/show_bug.cgi?id=9567

Some features changes do not require a reboot, and a reboot can take a while if there is a lot of firmware / drivers to initialize. Changing a GPU setting (like MIG) often requires just a driver unload/reload or a GPU reset. 
Ideally, the node_features/helpers should allow *some* features to be marked as not requiring a reboot. The usual node drain would still happen to make sure no other jobs are runnning on this node when the changed is made (as unloading a driver is disruptive) by the helper script, but there should be no reboot afterwards. Note: having a fake reboot script that just restarts slurmd does not work, as slurmctld expects the node to report a node boot (e.g. like "slurmd -b" does for debug purposes). 

Obviously, other features, like those impacting the BIOS, will always require a reboot.
Comment 9 Marcin Stolarek 2024-03-20 01:30:54 MDT
Felix,

The implementation of Flags=rebootless setting in helpers.conf was merged to our master branch (7914bb0476). You can find the docs in ./doc/man/man5/helpers.conf.5 on master.

Simply, you need to define a feature like:
>#cat helpers.conf
>NodeName=node1 Feature=a1,a2 Helper=/home/cinek/slurm-confs/Bug12288/h.sh Flags=RebootLess 
>                                                                          ^^^^^^^^^^^^^^^^

which won't prevents the actual node reboot on feature change.

Could you please give it a try and share your feedback?

cheers,
Marcin
Comment 11 Felix Abecassis 2024-03-25 15:41:54 MDT
Hello,

Functionally, it seems to work fine, thanks!

But there is a minor cosmetic issue, I still see the following slurmd log:
slurmd: Node reboot request with features mig=off being processed

But there is no reboot done.
Comment 15 Marcin Stolarek 2024-03-27 02:59:06 MDT
Hi Felix,

We've pushed updates to log messages arround to make it less confusing, commit: 624fea12c8[1].

cheers,
Marcin
[1]https://github.com/SchedMD/slurm/commit/624fea12c8c590705462c4247f6ff97457933936
Comment 16 Felix Abecassis 2024-03-27 11:50:40 MDT
I found another cosmetic issue while testing the feature, it's not really related to your changes but I thought I would mention it here, let me know if you want a new bug instead.

I see the following in the kernel log on starting slurmd:
```
Mar 27 10:45:05 ioctl kernel: process 'slurmd' launched '/etc/slurm/node_features_helpers/nvidia_mig_off' with NULL argv: empty string added
```

It's caused by this Linux kernel patch: https://github.com/torvalds/linux/commit/dcd46d897adb70d63e025f175a00a89797d31a43.
This seems to be coming from calling run_command without "script_argv" set in _feature_get_state().
Comment 17 Marcin Stolarek 2024-03-28 02:47:13 MDT
Felix - I've split this to Bug 19449 adding you as CC there.
Comment 18 Marcin Stolarek 2024-04-18 00:47:08 MDT
Felix,

I'll go ahead and mark the ticket fixed. The remaining issue is tracked separatelly, since it's not only related to new development, but the way Slurm executes external applications/scripts.

Should you have any questions, please reopen.

cheers,
Marcin