Ticket 5787 - Memory change in 17.11.7
Summary: Memory change in 17.11.7
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: Documentation (show other tickets)
Version: 17.11.7
Hardware: Linux Linux
: 4 - Minor Issue
Assignee: Jason Booth
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2018-09-28 03:48 MDT by GSK-ONYX-SLURM
Modified: 2018-10-05 08:13 MDT (History)
1 user (show)

See Also:
Site: GSK
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: ?
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description GSK-ONYX-SLURM 2018-09-28 03:48:51 MDT
Hi.
We are in the process of rolling out 17.11.7 into our production environment.  However I have seen in the release notes for 17.11.8 it says

-- Revert to previous behavior when requesting memory per cpu/node introduced
    in 17.11.7.

Can I get some more detail on what the memory change was that went into 17.11.7, and why it needed to be backed out?

We've not seen any issues with 17.11.7 in our dev / test environment but I need to understand what the impact is going to be.  We need 17.11.7 to fix a deadlock issue we see in earlier versions.  But I don't want to fix one problem just to walk into another.

Thanks.
Mark.
Comment 1 Jason Booth 2018-09-28 10:05:07 MDT
Greeting,

 This issue spans multiple tickets and is explained in commit:

https://github.com/SchedMD/slurm/commit/bf4cb0b1b01f3e

You can also see some additional comments from Bug #4976 comment #31 and comment #32
Comment 2 GSK-ONYX-SLURM 2018-09-28 10:45:59 MDT
Hi.

Yes I understand that memory fixes were applied in 17.11.7.

But the release notes for 17.11.8 imply those fixes were backed out, which by implication means one or more of the 4 fixes were bad.

Why were they backed out of 17.11.8 ?  What were the undesirable effects?

Thanks.
Mark.
Comment 3 Alejandro Sanchez 2018-10-01 10:00:09 MDT
Hey Mark,

This patch committed against 17.11.7

https://github.com/SchedMD/slurm/commit/bf4cb0b1b01f3e

was intended to address 4 issues as described in the commit description. Briefly:

1. On multi-partition requests, the job's pn_min_memory was validated against the partition MaxMemPer* the logic was reusing the previous pn_min_memory from the previous partition in the loop instead of using the original values.

2. Memory enforcement varied depending on the job being issued against a reservation or not.

3. (Arguably not an issue but a change in functionality, see bug 5240). When a job's memory request exceeds MaxMemPer* limit, Slurm was doing some automatic adjustments under the covers like increasing the number of requested cpus so that the mem/cpu ratio doesn't exceed the limit.

4. (Arguably not an issue but a change in functionality, see bug 5240). When requesting the special case of --mem[-per-cpu]=0, Slurm was allocating all the memory on the node as documented, even if exceeding MaxMemPer* limit.

So when the previous commit was applied to 17.11.7, bug 5240 (and a bunch of duplicates were filed) because people liked the previous behavior for 3 and 4 (automatic adjustments and --mem=0 allocating all memory). After discussing this further on this bug, we decided to revert the changes introduced in the commit above in this new commit for 17.11.8

https://github.com/SchedMD/slurm/commit/d52d8f4f0ce1a

and then on top of the revert apply a partial version the initial commit which was just fixing issues 1 and 2 but leaving untouched 3 and 4 behavior:

https://github.com/SchedMD/slurm/commit/f07f53fc138b22

Although my _personal_ viewpoint is that 3 and 4 were fine, it is true they are a change in functionality and those should had been fixed in the master branch and not in a micro-release. But by popular consensus it looks like customers liked more the original behavior with the automatic adjustments, so we reverted these 2 points.

In any case, what I'd suggest is to upgrade to the latest 17.11 version (currently 17.11.9-2 and soon 17.11.9 and if possible to the latest 18.08 branch which is already released).

Please, let us know if you have futher questions. Thanks.
Comment 4 Alejandro Sanchez 2018-10-01 10:02:50 MDT
(In reply to Alejandro Sanchez from comment #3)
> In any case, what I'd suggest is to upgrade to the latest 17.11 version
> (currently 17.11.9-2 and soon 17.11.9

... and soon 17.11.10.
Comment 5 Jason Booth 2018-10-04 11:01:10 MDT
Hi Mark - I am going to mark this as info given since Alex seems to have covered the details in comment #3
Comment 6 GSK-ONYX-SLURM 2018-10-05 08:13:47 MDT
Thanks Alex & Jason.

The information supplied tells me what I need to know.  We can continue with our planned upgrade to 17.11.7.  We'll be looking to move to 18.x after that.

Thanks.
Mark.