| Summary: | RFE: support rebootless node_features/helpers features changes | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Felix Abecassis <fabecassis> |
| Component: | slurmd | Assignee: | Marcin Stolarek <cinek> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 5 - Enhancement | ||
| Priority: | --- | CC: | brian, ezellma, jbernauer, lyeager |
| Version: | 21.08.x | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| See Also: | https://bugs.schedmd.com/show_bug.cgi?id=13545 | ||
| Site: | NVIDIA (PSLA) | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | 24.05.0rc1 | |
| Target Release: | 24.05 | DevPrio: | 1 - Paid |
| Emory-Cloud Sites: | --- | ||
|
Description
Felix Abecassis
2021-08-16 16:05:34 MDT
Felix,
The implementation of Flags=rebootless setting in helpers.conf was merged to our master branch (7914bb0476). You can find the docs in ./doc/man/man5/helpers.conf.5 on master.
Simply, you need to define a feature like:
>#cat helpers.conf
>NodeName=node1 Feature=a1,a2 Helper=/home/cinek/slurm-confs/Bug12288/h.sh Flags=RebootLess
> ^^^^^^^^^^^^^^^^
which won't prevents the actual node reboot on feature change.
Could you please give it a try and share your feedback?
cheers,
Marcin
Hello, Functionally, it seems to work fine, thanks! But there is a minor cosmetic issue, I still see the following slurmd log: slurmd: Node reboot request with features mig=off being processed But there is no reboot done. Hi Felix, We've pushed updates to log messages arround to make it less confusing, commit: 624fea12c8[1]. cheers, Marcin [1]https://github.com/SchedMD/slurm/commit/624fea12c8c590705462c4247f6ff97457933936 I found another cosmetic issue while testing the feature, it's not really related to your changes but I thought I would mention it here, let me know if you want a new bug instead. I see the following in the kernel log on starting slurmd: ``` Mar 27 10:45:05 ioctl kernel: process 'slurmd' launched '/etc/slurm/node_features_helpers/nvidia_mig_off' with NULL argv: empty string added ``` It's caused by this Linux kernel patch: https://github.com/torvalds/linux/commit/dcd46d897adb70d63e025f175a00a89797d31a43. This seems to be coming from calling run_command without "script_argv" set in _feature_get_state(). Felix - I've split this to Bug 19449 adding you as CC there. Felix, I'll go ahead and mark the ticket fixed. The remaining issue is tracked separatelly, since it's not only related to new development, but the way Slurm executes external applications/scripts. Should you have any questions, please reopen. cheers, Marcin |