| Summary: | Memory change in 17.11.7 | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | GSK-ONYX-SLURM <slurm-support> |
| Component: | Documentation | Assignee: | Jason Booth <jbooth> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | CC: | alex |
| Version: | 17.11.7 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | GSK | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | ? | Target Release: | --- |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
Greeting, This issue spans multiple tickets and is explained in commit: https://github.com/SchedMD/slurm/commit/bf4cb0b1b01f3e You can also see some additional comments from Bug #4976 comment #31 and comment #32 Hi. Yes I understand that memory fixes were applied in 17.11.7. But the release notes for 17.11.8 imply those fixes were backed out, which by implication means one or more of the 4 fixes were bad. Why were they backed out of 17.11.8 ? What were the undesirable effects? Thanks. Mark. Hey Mark, This patch committed against 17.11.7 https://github.com/SchedMD/slurm/commit/bf4cb0b1b01f3e was intended to address 4 issues as described in the commit description. Briefly: 1. On multi-partition requests, the job's pn_min_memory was validated against the partition MaxMemPer* the logic was reusing the previous pn_min_memory from the previous partition in the loop instead of using the original values. 2. Memory enforcement varied depending on the job being issued against a reservation or not. 3. (Arguably not an issue but a change in functionality, see bug 5240). When a job's memory request exceeds MaxMemPer* limit, Slurm was doing some automatic adjustments under the covers like increasing the number of requested cpus so that the mem/cpu ratio doesn't exceed the limit. 4. (Arguably not an issue but a change in functionality, see bug 5240). When requesting the special case of --mem[-per-cpu]=0, Slurm was allocating all the memory on the node as documented, even if exceeding MaxMemPer* limit. So when the previous commit was applied to 17.11.7, bug 5240 (and a bunch of duplicates were filed) because people liked the previous behavior for 3 and 4 (automatic adjustments and --mem=0 allocating all memory). After discussing this further on this bug, we decided to revert the changes introduced in the commit above in this new commit for 17.11.8 https://github.com/SchedMD/slurm/commit/d52d8f4f0ce1a and then on top of the revert apply a partial version the initial commit which was just fixing issues 1 and 2 but leaving untouched 3 and 4 behavior: https://github.com/SchedMD/slurm/commit/f07f53fc138b22 Although my _personal_ viewpoint is that 3 and 4 were fine, it is true they are a change in functionality and those should had been fixed in the master branch and not in a micro-release. But by popular consensus it looks like customers liked more the original behavior with the automatic adjustments, so we reverted these 2 points. In any case, what I'd suggest is to upgrade to the latest 17.11 version (currently 17.11.9-2 and soon 17.11.9 and if possible to the latest 18.08 branch which is already released). Please, let us know if you have futher questions. Thanks. (In reply to Alejandro Sanchez from comment #3) > In any case, what I'd suggest is to upgrade to the latest 17.11 version > (currently 17.11.9-2 and soon 17.11.9 ... and soon 17.11.10. Hi Mark - I am going to mark this as info given since Alex seems to have covered the details in comment #3 Thanks Alex & Jason. The information supplied tells me what I need to know. We can continue with our planned upgrade to 17.11.7. We'll be looking to move to 18.x after that. Thanks. Mark. |
Hi. We are in the process of rolling out 17.11.7 into our production environment. However I have seen in the release notes for 17.11.8 it says -- Revert to previous behavior when requesting memory per cpu/node introduced in 17.11.7. Can I get some more detail on what the memory change was that went into 17.11.7, and why it needed to be backed out? We've not seen any issues with 17.11.7 in our dev / test environment but I need to understand what the impact is going to be. We need 17.11.7 to fix a deadlock issue we see in earlier versions. But I don't want to fix one problem just to walk into another. Thanks. Mark.