As mentioned in item g) from the original design https://bugs.schedmd.com/show_bug.cgi?id=9567 Some features changes do not require a reboot, and a reboot can take a while if there is a lot of firmware / drivers to initialize. Changing a GPU setting (like MIG) often requires just a driver unload/reload or a GPU reset. Ideally, the node_features/helpers should allow *some* features to be marked as not requiring a reboot. The usual node drain would still happen to make sure no other jobs are runnning on this node when the changed is made (as unloading a driver is disruptive) by the helper script, but there should be no reboot afterwards. Note: having a fake reboot script that just restarts slurmd does not work, as slurmctld expects the node to report a node boot (e.g. like "slurmd -b" does for debug purposes). Obviously, other features, like those impacting the BIOS, will always require a reboot.
Felix, The implementation of Flags=rebootless setting in helpers.conf was merged to our master branch (7914bb0476). You can find the docs in ./doc/man/man5/helpers.conf.5 on master. Simply, you need to define a feature like: >#cat helpers.conf >NodeName=node1 Feature=a1,a2 Helper=/home/cinek/slurm-confs/Bug12288/h.sh Flags=RebootLess > ^^^^^^^^^^^^^^^^ which won't prevents the actual node reboot on feature change. Could you please give it a try and share your feedback? cheers, Marcin
Hello, Functionally, it seems to work fine, thanks! But there is a minor cosmetic issue, I still see the following slurmd log: slurmd: Node reboot request with features mig=off being processed But there is no reboot done.
Hi Felix, We've pushed updates to log messages arround to make it less confusing, commit: 624fea12c8[1]. cheers, Marcin [1]https://github.com/SchedMD/slurm/commit/624fea12c8c590705462c4247f6ff97457933936
I found another cosmetic issue while testing the feature, it's not really related to your changes but I thought I would mention it here, let me know if you want a new bug instead. I see the following in the kernel log on starting slurmd: ``` Mar 27 10:45:05 ioctl kernel: process 'slurmd' launched '/etc/slurm/node_features_helpers/nvidia_mig_off' with NULL argv: empty string added ``` It's caused by this Linux kernel patch: https://github.com/torvalds/linux/commit/dcd46d897adb70d63e025f175a00a89797d31a43. This seems to be coming from calling run_command without "script_argv" set in _feature_get_state().
Felix - I've split this to Bug 19449 adding you as CC there.
Felix, I'll go ahead and mark the ticket fixed. The remaining issue is tracked separatelly, since it's not only related to new development, but the way Slurm executes external applications/scripts. Should you have any questions, please reopen. cheers, Marcin