| Summary: | When using select/cons_res, exclusive jobs should get the whole memory (and not more) | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Thomas Opfer <hrz> |
| Component: | Contributions | Assignee: | Jacob Jenson <jacob> |
| Status: | RESOLVED INVALID | QA Contact: | |
| Severity: | 6 - No support contract | ||
| Priority: | --- | CC: | dameyer, hrz, jacob |
| Version: | 17.02.4 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| See Also: | https://bugs.schedmd.com/show_bug.cgi?id=5562 | ||
| Site: | -Other- | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | Lichtenberg High Performance Computer | CLE Version: | |
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Ticket Depends on: | 3926 | ||
| Ticket Blocks: | |||
| Attachments: |
Incomplete fix.
Fix that Danny proposed. Fix that Danny proposed (in a better format). |
||
|
Description
Thomas Opfer
2017-06-07 16:33:06 MDT
Hey Thomas, I can see benefit for what you are looking for.
I would lean towards a SelectTypeParameter perhaps on a partition level as well that turns on this behavior. Instead of what you have done perhaps try this...
diff --git a/src/plugins/select/cons_res/job_test.c b/src/plugins/select/cons_res/job_test.c
index 8e61f8d96f..7c7466d659 100644
--- a/src/plugins/select/cons_res/job_test.c
+++ b/src/plugins/select/cons_res/job_test.c
@@ -3759,7 +3759,8 @@ alloc_job:
return error_code;
/* load memory allocated array */
- save_mem = details_ptr->pn_min_memory;
+ if (job_ptr->details->whole_node != 1)
+ save_mem = details_ptr->pn_min_memory;
if (save_mem & MEM_PER_CPU) {
/* memory is per-cpu */
save_mem &= (~MEM_PER_CPU);
Which does sort of what you are looking for in just less lines. The kicker is if there is already default memory. There is no way to know at this point where this memory setting came from.
in src/slurmctld/job_mgr.c
if (job_desc_msg->pn_min_memory == NO_VAL64) {
/* Default memory limit is DefMemPerCPU (if set) or no limit */
if (part_ptr && part_ptr->def_mem_per_cpu) {
job_desc_msg->pn_min_memory =
part_ptr->def_mem_per_cpu;
} else {
job_desc_msg->pn_min_memory =
slurmctld_conf.def_mem_per_cpu;
}
set it (probably erroneously to the first part_ptr in the potential submitted list). Perhaps this should be set much later in life and then we would know for real what the request should be for.
Perhaps select_g_job_test should be changed to send in min_mem as well as we do with min_nodes or something.
At the moment I don't have a good answer for you how to get what you are looking for though. Let us know if you come up with something.
Hi Danny, thanks for your answer. I'm not sure whether we misunderstand each other. First of all, let me say the we enforce our users to provide "--mem-per-cpu". Jobs that do not provide it are refused by a submission plugin. The error that we see was reported in bug 3847. Here is a brief summary: Slurm allocates too much memory to the nodes if jobs run in exclusive mode but only request a few tasks. It seems that the number of cores that the node has is multiplied by the memory per core requested. For non-exclusive jobs, this is perfectly ok, but for exclusive jobs, this can lead to too much allocated memory. Imagine a node has 32 GB of memory and 16 cores. If i now request 1 core (-n 1) exclusively (--exclusive) with --mem-per-cpu=20000, then Slurm allocates 16*20 GB on this node, while the node only has 32 GB. Then Slurm complains that the node is overallocated and also the cgroup is not limiting anything. I thought my fix (or your proposed modification) could resolve this problem, but allocations done by srun do not work in this case. Do you have any idea why this happens? Best regards, Thomas I'm not sure. But I am not getting the error you speak of. On a node with 15737M memory and 8 cpus salloc --exclusive srun --mem-per-cpu=2000 -n1 -c8 hostname srun: error: Unable to create job step: Memory required by task is not available Let us know if you figure it out. But from what I can see Bug 3847 is the same bug as this. Hi Danny, I did not use salloc to create the allocation but let srun create it. (This seems to be important!) When I apply my fix, I get the above error. When I do not apply my fix, the node is overallocated (according to Slurm). Let me also mention again that this problem has nothing to do with requesting too many ressources as in your example. It is when I say something like srun -n 1 --exclusive --mem-per-cpu=10000 [...] not from within an salloc allocation, but just on a login node. In this example, Slurm (without my fix) allocates 16 CPUs (as the node has 16 CPU cores) and 16*10000MB memory (which the node does not have). In my opinion, this should not allocate more than the node has. In our example, it should not allocate 160 GB, but only 32 GB (which would be possible). Best regards, Thomas I just did another test and the allocation is also somehow wrong for salloc: to86cola@hla0002:~$ /opt/slurm/current/bin/salloc -n 1 --exclusive --mem-per-cpu=25000 -t 30 -C mpi -w hpa0001 salloc: Pending job allocation 3600377 salloc: job 3600377 queued and waiting for resources salloc: job 3600377 has been allocated resources salloc: Granted job allocation 3600377 salloc: Waiting for resource configuration salloc: Nodes hpa0001 are ready for job to86cola@hla0002:~$ scontrol show node hpa0001|grep TRES CfgTRES=cpu=16,mem=28000M AllocTRES=cpu=16,mem=400000M to86cola@hla0002:~$ exit salloc: Relinquishing job allocation 3600377 to86cola@hla0002:~$ Also then srun accepts way too much memory: to86cola@hla0002:~$ /opt/slurm/current/bin/salloc -n 1 --exclusive --mem-per-cpu=25000 -t 30 -C mpi -w hpa0001 salloc: Pending job allocation 3600379 salloc: job 3600379 queued and waiting for resources salloc: job 3600379 has been allocated resources salloc: Granted job allocation 3600379 salloc: Waiting for resource configuration salloc: Nodes hpa0001 are ready for job to86cola@hla0002:~$ srun -n 1 --mem-per-cpu=400000 hostname hpa0001 to86cola@hla0002:~$ srun -n 1 --mem-per-cpu=400001 hostname srun: error: Unable to create job step: Memory required by task is not available to86cola@hla0002:~$ I'll try what happens with my patch in this case and come back to you then. Best regards, Thomas Created attachment 4827 [details] Fix that Danny proposed. I attached the fix that Danny proposed. It seems to be working correctly, but there is a bug in srun that should be fixed before this is applied. Then the above errors disappear. See bug 3926. Created attachment 4828 [details]
Fix that Danny proposed (in a better format).
(In reply to Thomas Opfer from comment #10) > Created attachment 4828 [details] > Fix that Danny proposed (in a better format). I just wanted to give an update on that. We have been running this for a few weeks now (together with the fix for bug 3926, which has been added to Slurm in the meanwhile) and the overallcation errors are gone. It seems to work as intended. Best regards, Thomas *** Ticket 3847 has been marked as a duplicate of this ticket. *** I see this is still open. Was it addressed in a later version of slurm or a workaround identified. We are running 18.08.5-2 and I have seen the error a couple times. Thank you, Doug |