Ticket 3011

Summary:	Change of behaviour - can no longer submit to a down partition in 16.05.4
Product:	Slurm	Reporter:	Chris Samuel <samuel>
Component:	slurmctld	Assignee:	Alejandro Sanchez <alex>
Status:	RESOLVED FIXED	QA Contact:
Severity:	3 - Medium Impact
Priority:	---	CC:	alex
Version:	16.05.4
Hardware:	Linux
OS:	Linux
Site:	VLSCI	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:	16.05.5
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---

Description Chris Samuel 2016-08-21 22:10:58 MDT

Hi folks,

I just upgraded our test cluster from 16.05.3 to 16.05.4 before a planned outage next week and I find that I can no longer submit to partitions that are marked as down, whereas I'm 99% certain that this worked in 16.05.3 as I did that this morning.

Now I get:

[samuel@bruce MPI]$ sbatch latency.sh
sbatch: error: Batch job submission failed: Requested partition configuration not available now

This isn't good as marking partitions down but allowing jobs to be submitted is something we do from time to time as it's the nearest thing Slurm had to Moab's scheduler "pause" feature (i.e. jobs can still come in but won't be scheduled).

Whilst there are some changes relating to partitions mentioned in the NEWS file they all seem to be bug fixes and not a behaviour change like this.

Is this an unintended consequence?

We have:

EnforcePartLimits=YES

which has been unchanged since 2013 according to git.

All the best,
Chris

Comment 1 Alejandro Sanchez 2016-08-22 03:14:07 MDT

Hi Chris. In 16.05.0pre1 EnforcePartLimits behaviour was modified so that it not only accepts YES|NO values but instead NO|[YES|ANY]|ALL. You can see the change in the 16.05.0pre1 NEWS file entry:

Enhance slurm.conf option EnforcePartLimit to include options like "ANY" and
"ALL". "Any" is equivalent to "Yes" and "All" will check all partitions
a job is submitted to and if any partition limit is violated the job will
be rejected even if it could possibly run on another partition.

Now, bug #2920 made us notice that depending on the order of a multipartition submission (e.g. --partition=part1,part2,partN) the EnforcePartLimits was only considered for the last partition (partN) instead that taking into account all the submitted partitions _and_ the actual value of EnforcePartLimits itself. So it was not worked as it was designed for 16.05. This was fixed in 16.05.4 in commits:

https://github.com/SchedMD/slurm/commit/3bc80da7620cc2c3cfa04d7ea7b9b3b5db5a72a7
https://github.com/SchedMD/slurm/commit/3bc80da7620cc2c3cfa04d7ea7b9b3b5db5a72a7
https://github.com/SchedMD/slurm/commit/3bc80da7620cc2c3cfa04d7ea7b9b3b5db5a72a7
https://github.com/SchedMD/slurm/commit/30baec8d454d8ce79b1350c510f935a003c4a719

So this is not a behaviour change from 16.05.3 to 16.05.4, the behaviour change was made in 16.05.0pre1 (in fact it was an enhancement to EnforcePartLimtits to provide more flexibility), but rather a bug fix since it was not working as expected. We updated the documentation so admins can check which partition limits are considered to be enforced. From the slurm.conf in 16.05.4:

NOTE: The partition limits being considered are it's configured MaxMemPerCPU, MaxMemPerNode, MinNodes, MaxNodes, MaxTime, State (e.g. DOWN or INACTIVE), AllocNodes, AllowAccounts, AllowGroups, AllowQOS, and QOS usage threshold.

So the partition State matters against the EnforcePartLimits value.

Imagine you had 3 partitions:

part1 in UP state
part2 in UP state
part3 in DOWN state

in 16.05.3 with EnforcePartLimits=YES (which is the same to ANY, which means that at least one partition should satisfy the limits), the submission against --part=part1,part2,part3 was rejected (because only the last one was considered) but a submission against --part=part3,part1,part2 was accepted. And this was fixed in 16.05.4.

At submission time you can check in your slurmctld.log which partition and which limit can make a job to be rejected:

debug2("Job %u requested down partition %s", job_ptr->job_id, part_ptr->name);

if you have EnforcePartLimits=YES it means that at least one of the job partitions must be UP. Can you please check which partitions is your job submitted against and in which states are these partitions when you submit the job?

Thanks.

Comment 2 Chris Samuel 2016-08-22 18:20:16 MDT

Hi Alejandro,

I'm a bit puzzled because you say that previously submission to a DOWN partition would fail if it was the last in the list, but we have only ever advertised and used submission to a single partition and so I would have thought that would mean that submitting to a single DOWN partition would have failed, but that has never been our experience until 16.05.4.

I appreciate how you feel this isn't a change in behaviour, but as an external observer the fact that we can no longer submit to a DOWN partition in 16.05.4 is pretty much definitively that.

How can I exclude the consideration of the state of a partition from causing a job to be rejected, or removed from a list of partitions, by Slurm please?

EnforcePartLimits=NO doesn't seem to be an option as I still want jobs rejected if they are submitted to a partition they don't have permission to use.

If we have a situation in future where we have users submitting to multiple partitions (say one for all users and one covering a set of nodes owned by a particular group) I really cannot afford for jobs submitted to both to have the private partition dropped from newly submitted jobs from their owners if we've set that partition as DOWN for some reason.

There's also the case of pipelines of jobs where jobs submit later stages of the pipeline and if those are rejected due to a partition being DOWN rather than just being queued until the partition is UP then you'll have a lot of very unhappy bioinformaticians (to use one example that I know rather well).

thanks,
Chris

Comment 3 Alejandro Sanchez 2016-08-23 05:43:48 MDT

Chris, let's see if I can clarify the history behind this. When first 16.05 version was released, as I said in my previous comment, the EnforcePartLimits was changed to accept 3 values: NO|[ANY|YES]|ALL. So the intented behaviour when first 16.05 version was released was to check the job request against the partition(s) limits and the EnforcePartLimit value and accept/reject the submission upon that.

So for instance if a job was requested against one single partition and the job didn't satisfy the partition limits and enforce was different to NO, then the design was to reject this job.

I can understand that you are puzzled and I'm gonna explain you why. The check against the partition limits is made in different spots. One of the spots, where the job accessibility to the partition is made (e.g. DenyAccounts value vs job account) was properly coded to handle the EnforcePartLimits value. So no matter the 16.05 version if you send a job to a single partition with DenyAccounts=x and job account=x and EnforcePartLimtis != NO, the job is rejected.

But the partitions have other limits, such as MaxNodes, State, MaxMemPerNode, etc. that were also designed to be sensitive to the EnfocePartLimits value. The problem is that when first 16.05 version was released, there was a bug because job request was not sensitive to EnforcePartLimits in the spot where this other checks are handled. And this was detected in bug #2920 and fixed in 16.05.4. So I can understand that to customer's eyes it might seem like a design/behaviour change, but the fact is that the design was that the partition limits should be sensitive to EnforcePartLimits and job request. This was simply a bug.

So now, known the history, I can internally discuss with the team if the specific partition check for the partition State should be sensitive or not to EnforcePartLimits value. The ideal thing would be to provide more flexibility to the user and let the administrator choose which specific partition limits do they want to be sensitive to enforce, so they have more granularity. But right now this is not configurable at all.

So let me internally discuss how do we proceed and if partition State should be removed from enforce consideration. Sorry for the confussion caused. We're gonna think what's the best solution for this and come back to you. Thanks.

Comment 6 Chris Samuel 2016-08-23 18:02:11 MDT

Hi Alejandro,

Thanks for that additional explanation, I think that helps makes things clearer for me.  Except that the manual page for scontrol says:

# DOWN      Designates that new jobs may be queued on the partition

I was thinking further about this last night and was pondering the idea of having a DRAIN state for partitions in the same way you can have the DRAIN state for a node. Then I checked the manual page for scontrol this morning and realised there was already a state there for that, but that it says DRAIN rejects jobs, whereas DOWN should accept jobs.

I did try and see if my meagre C skills were enough to work around the new behaviour for another groups cluster I upgraded Slurm on yesterday, but on the test system the best I could achieve was to get the job accepted but the user was still told submission failed.  So in the end I just upgraded to 16.05.3.

Our own clusters are down for a big maintenance window next week (need to upgrade firmware on IB switches, upgrade GPFS, etc, so it's a complete outage) so I suspect we'll need to go to 16.05.3 here too.

If you folks are able to come up with a small patch that will accept jobs if a partition is down I'm willing to carry that locally, we already apply 2 patches of our own via quilt (one to revert a bug fix in 15.x where #SBATCH directives are now ignored after the first non-comment as we have lots of scripts relying on that and another to change the location of the PMI2 socket to /dev/shm as we have a plugin that bind-mounts a job specific directory on our /scratch over /tmp and /var/tmp & thus breaks PMI2).

All the best,
Chris

Comment 7 Alejandro Sanchez 2016-08-24 02:51:45 MDT

(In reply to Chris Samuel from comment #6)
> Hi Alejandro,
> 
> Thanks for that additional explanation, I think that helps makes things
> clearer for me.  Except that the manual page for scontrol says:
> 
> # DOWN      Designates that new jobs may be queued on the partition

Yes, we also realized that was not coherent with 16.05.4 behaviour.

> I was thinking further about this last night and was pondering the idea of
> having a DRAIN state for partitions in the same way you can have the DRAIN
> state for a node. Then I checked the manual page for scontrol this morning
> and realised there was already a state there for that, but that it says
> DRAIN rejects jobs, whereas DOWN should accept jobs.
> 
> I did try and see if my meagre C skills were enough to work around the new
> behaviour for another groups cluster I upgraded Slurm on yesterday, but on
> the test system the best I could achieve was to get the job accepted but the
> user was still told submission failed.  So in the end I just upgraded to
> 16.05.3.
> 
> Our own clusters are down for a big maintenance window next week (need to
> upgrade firmware on IB switches, upgrade GPFS, etc, so it's a complete
> outage) so I suspect we'll need to go to 16.05.3 here too.
> 
> If you folks are able to come up with a small patch that will accept jobs if
> a partition is down I'm willing to carry that locally, we already apply 2
> patches of our own via quilt (one to revert a bug fix in 15.x where #SBATCH
> directives are now ignored after the first non-comment as we have lots of
> scripts relying on that and another to change the location of the PMI2
> socket to /dev/shm as we have a plugin that bind-mounts a job specific
> directory on our /scratch over /tmp and /var/tmp & thus breaks PMI2).
> 
> All the best,
> Chris

We're gonna create a patch that you will be able to locally apply or wait for 16.05.5.

Comment 11 Chris Samuel 2016-08-24 18:16:34 MDT

Hi Alejandro,

Thanks so much for that - happy to see what happens, at the very least we'll go to 16.05.3 next week (needed for best use of Shifter from what Doug says).

All the best,
Chris

Comment 14 Alejandro Sanchez 2016-08-26 04:07:21 MDT

Chris, following commit should allow jobs submitted to DOWN partitions to be accepted:

https://github.com/SchedMD/slurm/commit/76d62ae466e84f0c6

Commit will be available in Slurm 16.05.5, or you can apply the commit patch beforehand by appending the ".patch" prefix to the commit URL:

https://github.com/SchedMD/slurm/commit/76d62ae466e84f0c6.patch

Please, let me know if you encounter any issues, otherwise if you give me the ok I will mark the bug as resolved/fixed. Thanks.

Comment 15 Alejandro Sanchez 2016-08-26 09:18:26 MDT

Chris, if you apply the previous patch locally, please also apply this other one which is an extension of the previous and they should go hand in hand. If not, just wait for next micro version 16.05.5:

https://github.com/SchedMD/slurm/commit/2e4552df788.patch

Again, please let us know how it goes. Thanks.

Comment 16 Chris Samuel 2016-08-28 18:41:13 MDT

Hi Alejandro,

Thanks so much for the patches and sorry about the tardy response, the university Exchange server was sulking and randomly delaying emails to me for days.  Fixed now so these arrived over the weekend.

We'll test out these patches and see how they go.

Do you have any idea of when 16.05.5 might appear?  We've an extended shutdown for filesystem work and so if it's just a few days away we might factor that in.

cheers!
Chris

Comment 17 Moe Jette 2016-08-29 08:38:33 MDT

Hi Chris,

Alejandro is on vacation, so I'll respond. Slurm version 16.05.4 was released on August 14 and is proving relatively stable. Unless that situation changes, I don't expect to release 16.05.5 until mid-September.

Moe


(In reply to Chris Samuel from comment #16)
> Hi Alejandro,
> 
> Thanks so much for the patches and sorry about the tardy response, the
> university Exchange server was sulking and randomly delaying emails to me
> for days.  Fixed now so these arrived over the weekend.
> 
> We'll test out these patches and see how they go.
> 
> Do you have any idea of when 16.05.5 might appear?  We've an extended
> shutdown for filesystem work and so if it's just a few days away we might
> factor that in.
> 
> cheers!
> Chris

Comment 18 Chris Samuel 2016-08-29 17:37:32 MDT

(In reply to Moe Jette from comment #17)

> Hi Chris,

Hiya Moe,

> Alejandro is on vacation, so I'll respond. Slurm version 16.05.4 was
> released on August 14 and is proving relatively stable. Unless that
> situation changes, I don't expect to release 16.05.5 until mid-September.

No worries, that's fine with us, we're all set to roll out 16.05.4 with these patches when we go live once all the rest of our upgrade work is done.

Thanks again for great support!
Chris

Comment 19 Alejandro Sanchez 2016-09-13 07:38:10 MDT

Hi Chris, is it fine to close this bug or you have any more questions?

Comment 20 Chris Samuel 2016-09-13 19:20:22 MDT

(In reply to Alejandro Sanchez from comment #19)

> Hi Chris, is it fine to close this bug or you have any more questions?

Fine to close, thanks!

All the best,
Chris