Ticket 13309

Summary: ActiveFeatures job submission
Product: Slurm Reporter: AB <Alexander.Block>
Component: SchedulingAssignee: Brian Christiansen <brian>
Status: RESOLVED WONTFIX QA Contact: Brian Christiansen <brian>
Severity: 5 - Enhancement    
Priority: --- CC: brian, jbooth, tim
Version: 20.11.7   
Hardware: Linux   
OS: Linux   
Site: LRZ Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description AB 2022-02-02 01:52:19 MST
Hi,

I have posted this already in the slurm-users mailing list but Jess Arrington advised me to file a bug. So here is my problem:

We have configured 4 nodes with certain features, e.g.

"NodeName=thin1 Arch=x86_64 CoresPerSocket=24
   CPUAlloc=0 CPUTot=96 CPULoad=44.98
   AvailableFeatures=work,scratch
   ActiveFeatures=work,scratch

..."

The features are obviously filesystems mounted. Now we are going to take away one filesystem (work) for maintenance. Therefore we wanted to take away the feature from the nodes. We tried e.g.

# scontrol update node=thin1 ActiveFeatures="scratch"

resulting in

"NodeName=thin1 Arch=x86_64 CoresPerSocket=24
   CPUAlloc=0 CPUTot=96 CPULoad=44.98
   AvailableFeatures=work,scratch
   ActiveFeatures=scratch

..."

The problem now is that no jobs can be SUBMITTED requesting the feature work, the error we get is

"sbatch: error: Batch job submission failed: Requested node configuration is not available"


Does this make sense? We want our users to submit jobs requesting features that are available in general because maintenances usually don't last too long and the users want to submit jobs for the time once the feature is available again since we have rather long queuing times. I understand that jobs might be rejected when the feature is not available at all but not when it is not active?! Furthermore, also 4 node jobs get rejected at submission when the feature is only active on 3 nodes. Is this a bug? Wouldn't it make more sense that the job just sits in the queue waiting for the features/resources to be activated again?

Thanks,
Alexander
Comment 7 AB 2022-02-08 03:39:12 MST
Hi,

since I did not hear anything since I opened the ticket one week ago I just want to ask what the status of this ticket is?

Thanks,
Alexander
Comment 8 Carlos Tripiana Montes 2022-02-08 06:29:15 MST
That is my fault Alexander,

My apologies.

We've been internally discussing your case and I'm now testing a couple of things before sending you an answer.

Again, sorry for the lack of response,
Carlos.
Comment 9 Carlos Tripiana Montes 2022-02-08 07:54:42 MST
Alexander,

The stated in Description is a natural way to go for implementing hotswap for filesystems. But it's not the good way to go.

We can accept to make the job wait and not be refused if a requested feature is not available or active. BUT, if we allow to send the job, the NodeFeatures plugin will be actively seeking for the nodes to re-activate the disabled feature.

This is accomplished calling the RebootProgram, and letting the script to handle the reconfiguration of the node to make the feature active after reboot. This comes from the nature of this plugin, which is very tight to KNL and the handling of their different compute layers.

So if we, now, where allowing it, you'll end with your nodes constantly rebooting.

The good new is that there's another approach that will work. Using Licenses: you would want to do is define a pool of local licenses, let's say 1,000,000, in your slurm.conf. Then either have the users request one of the licenses or have a submit_script apply the license to every job that is submitted.

If you want to keep the users away from that FS, set this [1] in slurm.conf. Then, deactivate access setting a reservation as explained in [2]:

scontrol create reservation starttime=[...] \
   duration=[...] user=root flags=license_only \
   licenses=[NAME]:[TOTAL_NUMBER]

You can also go for the remote licenses mode explained in [3], but I think local ones are easy for this purpose. Remote licenses can also be set to zero with sacctmgr as explained in [3].

Please, tell us if this is feasible for you.

Cheers,
Carlos.

[1] https://slurm.schedmd.com/slurm.conf.html#OPT_allow_zero_lic
[2] https://slurm.schedmd.com/reservations.html#creation
[3] https://slurm.schedmd.com/licenses.html
Comment 10 AB 2022-02-08 08:54:37 MST
Hi Carlos,

thanks for your answer.

I understand what you are saying. However then the configuration of features for nodes makes not really sense to me. In the documentation of slurm.conf is stated "Features are intended to be used to filter nodes eligible to run jobs via the --constraint argument." I can also think of the case that only a subset of the nodes will not have a certain feature for a limited time then submission of jobs is also suppressed (e.g. jobs requesting all nodes).

In our recent case we could of course use the license workaround. But this only works proper if ALL nodes are available or not. There is no node specific license availability to my knowledge?!

Best regards,
Alexander
Comment 11 Carlos Tripiana Montes 2022-02-08 09:28:56 MST
Hi Alexander,

> In the documentation of slurm.conf is stated "Features are intended to be used to filter nodes eligible to run jobs via the --constraint argument."

If you are only playing with static features (shown as active==avail) and never modify on the fly what's in active/avail this statement is still valid. Further information is provided in [1] and [2].

> I can also think of the case that only a subset of the nodes will not have a certain feature for a limited time then submission of jobs is also suppressed (e.g. jobs requesting all nodes).

If you set licenses=0 nobody that wants to use that license will be allowed to run, but it will be allowed to pend in queue until licenses are available.

You can code a job_submit.lua so that if the user sets constraint A the script removes it from flags and sets X licenses of type A (same as num nodes). So only the jobs that wants this file system will be prevented to run.

> In our recent case we could of course use the license workaround. But this only works proper if ALL nodes are available or not.

I think the way I've just proposed above should workaround this, if I'm understanding well your concerns.

> There is no node specific license availability to my knowledge?!

Ahh, I'm afraid not. Much in the same way there's no option today to say to the NodeFeatures logic to "please, do not reboot the node, this feature is plug-n-play, it doesn't require a reboot).

I don't think the last bit is a bad idea. I think this is something that might be a future work open for a sponsorship. That could be further discussed with us if there's a real interest on it.

Cheers,
Carlos.

[1] https://slurm.schedmd.com/slurm.conf.html#OPT_NodeFeaturesPlugins
[2] https://slurm.schedmd.com/intel_knl.html
Comment 12 AB 2022-02-09 03:23:44 MST
Hi Carlos,

what is not clear to me is the different behavior in Slurm.
Lets say a user submits on a 4 node cluster with certain features on all nodes a job requesting all 4 nodes with all features. The job is submitted, the cluster is full so the job goes to pending. Then it happens that a node goes down. The user can still submit the same job, the job is pending in the queue because the requirements are not fulfilled. But this jobs would start as soon as the node comes back. Now the node comes back only missing  something (lets say a file mount is still pending). The admins are still working on the node, disabling the feature in the meantime for this node but bring it back to operation. Now the user can't even submit his job again and he has no chance to get informed when the feature/configuration is available again.
I think it is ok that configuration parameters like CPU and Memory (i.e. hardware)  are checked strictly at submission but for the free configurable features this is a bit too hard in my eyes.

In any case what I may ask you before you can close the ticket is if you maybe can point me to the part in the code where the rejection is implemented. I may have a look and check if I can patch it for our needs myself.

Thanks,
Alexander
Comment 13 Carlos Tripiana Montes 2022-02-09 05:24:32 MST
Hi Alexander,

> The user can still submit the same job, the job is pending in the
> queue because the requirements are not fulfilled.

You've spotted the difference. The job waits until there's resources to run, but the configured resources meets the constrained features. The job just waits until it can run.

On the other hand, if the submission knows it can't never meet the constraints, it refuses to accept the job.

Yes, it's counterintuitive. It's very tight to the NodeFeaturesPlugins nature, which is to use Active/Available features to know if it needs to reboot a node on the fly to meet a job's constraints. So if you aren't using NodeFeaturesPlugins parameter, he thinks those features are static, and if not available it will never be available.

As I said, we're not refusing your proposal of changing this, and allow the job to wait. But, up to now, it's not implement and I have no roadmap right now for it. In any case, you will still have the call to RebootProgram, something you'll probably don't want. To avoid the reboot, we need the feature extension of adding a parameter to instruct Slurm to reboot or not the job's nodes. This is even more complex, thus the sponsorship thing.

Nevertheless, this might be another option: If you go and set NodeFeaturesPlugins=node_features/helpers [1][2]. This still reboots the nodes, but it will allow you to stay in queues rather than reject the job. As I said, this is probably not the way you want to go unless you don't care of RebootProgram and you can fake it to /bin/true or similar.

> In any case what I may ask you before you can close the ticket is if you
> maybe can point me to the part in the code where the rejection is
> implemented. I may have a look and check if I can patch it for our needs
> myself.

That's compromising to me. I mean, to answer your question I need to do the same previous study I'd need to do if we were going to patch this now. And, at the end of the day, you still be forced to reboot the node. Additionally, the feature extension of adding a parameter to instruct Slurm reboot the nodes or not is even more complex. Not a simple patch.

All in all, this effort would be made for an unsupported issue: We don't give support to problems derived as a consequence of custom patching, as you might be aware of. So devoting time to this extent is not a good idea... better if you workaround this with NodeFeaturesPlugins=node_features/helpers instead.

But, I still think going for the Licences approach would be less error prone by now. I hope you'll understand the situation.

Cheers,
Carlos.

[1] https://slurm.schedmd.com/slurm.conf.html#OPT_NodeFeaturesPlugins
[2] https://slurm.schedmd.com/helpers.conf.html
Comment 14 AB 2022-02-09 06:01:56 MST
Hi Carlos,

maybe there was a misunderstanding.
My idea is the following - I can specify any feature on a node I like. And I can take it away any time I like (without rebooting but reconfigure/restart of course).
Jobs that have been submitted while the feature was configured stay in the queue, running jobs keep running (I tested this before) but pending jobs just will not start. BUT nothing will try to reboot nodes, right? New jobs requesting the feature will not be submitted - that's the point where I want to allow jobs requesting features that are not there. There is no reboot, just simply let them go to the queue. They will not run of course but all others in the queue still requesting the feature taken away will also not run and stay in the queue forever or till some administrator cancels them. Would this make sense to you?
I understand that this will not be supported by SchedMD.
And I have no idea if I can make it work. But if, then we would have a solution for our problem.

Thanks again,
Alexander
Comment 15 Carlos Tripiana Montes 2022-02-09 07:41:29 MST
I see...

> Jobs that have been submitted while the feature was configured stay in the
> queue, running jobs keep running (I tested this before) but pending jobs
> just will not start.

Indeed.

> BUT nothing will try to reboot nodes, right?

Because you don't have NodeFeaturesPlugins set.

> New jobs
> requesting the feature will not be submitted - that's the point where I want
> to allow jobs requesting features that are not there.

If the submission knows it can't never meet the constraints, it refuses to accept the job if you don't have NodeFeaturesPlugins set. Obviously, it doesn't cancel all the jobs already there, as you pointed out. But this interaction/behaviour is a side effect. The purpose of Available/Active features is, as I said, to instruct NodeFeaturesPlugins which config needs to be set and reboot the nodes.

If you only use static features you shouldn't be playing with ActiveFeatures as it is not for what you are looking for to accomplish.

> There is no reboot,

Because you don't have NodeFeaturesPlugins set.

> just simply let them go to the queue.

As I said: we're not refusing your proposal of changing this, and allow the job to wait. But, up to now, it's not implement and I have no roadmap right now for it. AND: If we allow the logic to accept the job it will reboot your node in the event you have a NodeFeaturesPlugins in use, or you enable it lately.

> They will not run of course but all
> others in the queue still requesting the feature taken away will also not
> run and stay in the queue forever or till some administrator cancels them.

Or until you add back the feature that is Available but not Active.

> Would this make sense to you?

Yes it makes sense. I've realized of your concerns since the beginning but I'm trying to bring you alternatives since this is not a bug but intended behaviour and, even though I agree that we can modify it, I have no exact rule to tell you when it would be implemented, even if we are finally to do so (all feature changes needs to be peer reviewed and accepted).

Anyway, give me a while to internally discuss again this. We still think the correct way to go is not only accepting the job to be sent regardless if there's any of NodeFeaturesPlugins set or not, but also a way to tell slurm to reboot or not the node depending on the feature, something like adding to node_features/helpers a new parameter to determine if we need to reboot or not the node.

Cheers,
Carlos.
Comment 32 Carlos Tripiana Montes 2022-03-14 04:50:31 MDT
Hi Alexander,

This update is to let you know I'm now in the middle of the patch review to add support for not refusing jobs if the features are "inactive".

They will be allowed to be submitted, but pending with BadConstraints.

The idea of "inactive" feature is one that is avail but not active.

The idea of a job that request an inactive features if for when the constraints string of the job has a formula in which the whole evaluation is "requirements not meet", but if you activate the avail features in some nodes, then you can "meet the requirements".

I hope the review process doesn't take to long, but the whole logic is a bit tricky and needs special care to avoid bugs.

Have a good day,
Carlos.
Comment 33 AB 2022-03-15 02:17:45 MDT
Hi Carlos,

thanks for the feedback. That's good news.
I also had a look at the code in the meantime and decided not to dig deeper - as you said, things are a bit tricky and for me it was too much effort.
But good to know that there will be a solution from your side.

Thanks,
Alexander
Comment 38 Carlos Tripiana Montes 2022-04-18 04:37:25 MDT
Alexander,

Still working in this improvement. It's not stalled.

Regards,
Carlos.
Comment 39 AB 2022-04-21 07:58:39 MDT
Hi Carlos,

thanks for the update.

Regards,
Alexander
Comment 42 Carlos Tripiana Montes 2022-05-12 01:23:40 MDT
Hi Alexander,

We're still working in this issue. I guess it's not going to be fast to get this done and pushed.

If you see silence intervals like the last one please be comprehensive as we are facing some complexities and we need to address quite a lot of other stuff as well. This doesn't mean we have left aside your issue.

Thanks for your understanding.

Regards,
Carlos.
Comment 44 Jason Booth 2022-05-18 13:44:09 MDT
Just some minor bookkeeping here. We are converting this to an enhancement and passing along an update for this issue. We are still looking into this since it overlaps with some other work planned for our next release. As such, this will not be part of 22.05.
Comment 45 AB 2022-05-18 13:48:43 MDT
Hi,

thanks for your updates.

If you wish you may also close this ticket. I will see the enhancement once you have implemented this into a version.

Best,
Alexander
Comment 46 Carlos Tripiana Montes 2022-05-30 01:35:26 MDT
Hi Alexander,

> If you wish you may also close this ticket.

We're not going to close it, so we can better track that enhancement.

Our work here overlaps with some other development, but it's still to be clear whether that development is going to fully supersede or not what we have discussed here.

As result, it's best to keep it open as enhancement. I'll let you know of any related news but in the meantime I'm stalling this bug.

Regards,
Carlos.
Comment 51 Jason Booth 2023-11-09 14:13:20 MST
I wanted to pass along an update regarding this issue. Although we thought this might overlap with other development, this issue diverged in a different 
direction. Carlos's initial work on this looked promising, but after reviewing 
this and the work involved, it is not something we plan to address anytime in the 
near future. 

This resolution is unfortunate given our initial comments on patching this, 
however, we have to consider where we spend our development time. I do apologize 
for the inconvenience this might cause.