Ticket 6769 - select_cons_res memory under-allocated
Summary: select_cons_res memory under-allocated
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmctld (show other tickets)
Version: 18.08.6
Hardware: Linux Linux
: 4 - Minor Issue
Assignee: Alejandro Sanchez
QA Contact:
URL:
: 7221 7866 (view as ticket list)
Depends on:
Blocks:
 
Reported: 2019-03-27 16:23 MDT by Marshall Garey
Modified: 2021-09-09 02:49 MDT (History)
7 users (show)

See Also:
Site: SchedMD
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 19.05.3 20.02.0pre1
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
slurm.conf (4.23 KB, text/plain)
2019-03-27 16:23 MDT, Marshall Garey
Details
alex slurm.conf (3.10 KB, text/plain)
2019-04-30 09:47 MDT, Alejandro Sanchez
Details
6769_1905_v1 (4.66 KB, patch)
2019-05-30 06:55 MDT, Alejandro Sanchez
Details | Diff
patch version 2 (9.38 KB, patch)
2019-06-10 15:09 MDT, Moe Jette
Details | Diff

Note You need to log in before you can comment on or make changes to this ticket.
Description Marshall Garey 2019-03-27 16:23:24 MDT
Created attachment 9716 [details]
slurm.conf

I'm getting errors like this occasionally:

[2019-03-27T16:18:32.769] error: select/cons_res: node v4 memory is under-allocated (0-2048) for JobId=61149


I reproduced this with a bunch of job submissions like this:

marshall@voyager:~/slurm/18.08/voyager$ for i in {1..20}; do sbatch --mem=2G -Dtmp -N1 --wrap="srun whereami 1"; done 


marshall@voyager:~/slurm/18.08/voyager$ sacct -j 61149 --format=jobid,alloctres%30,reqtres%30
       JobID                      AllocTRES                        ReqTRES 
------------ ------------------------------ ------------------------------ 
61149         billing=2,cpu=2,mem=2G,node=1  billing=1,cpu=1,mem=2G,node=1 
61149.batch             cpu=2,mem=2G,node=1                                
61149.extern  billing=2,cpu=2,mem=2G,node=1                                
61149.0                 cpu=1,mem=2G,node=1                                


I'm guessing it has something to do with requesting 1 CPU, but having CR_core_memory and hyperthreading, so that I get two cpu's. I'm attaching my slurm.conf.

Bug 6639 had these errors, though it appears the customer didn't file a separate ticket for them.
Comment 4 Alejandro Sanchez 2019-04-11 10:33:22 MDT
With this config:

SelectType=select/cons_res
SelectTypeParameters=CR_Core_Memory
NodeName=compute[1-2] SocketsPerBoard=1 CoresPerSocket=4 ThreadsPerCore=2 RealMemory=12000 NodeHostname=polaris State=UNKNOWN Port=61201-61202 GRES=gpu:tesla:2
PartitionName=p1 Nodes=ALL Default=YES State=UP DefMemPerCPU=800

Running the regression on 18.08 HEAD I got the under-allocated error:

[2019-04-11T15:30:39.666] error: select/cons_res: node compute1 memory is under-allocated (0-1600) for JobId=20186

so went to regression log and found job 20186 was submitted from within

TEST: 1.42
spawn /home/alex/slurm/18.08/install/bin/sbatch --output=/dev/null --error=/dev/null -t1 test1.42.input1
Submitted batch job 20185
spawn /home/alex/slurm/18.08/install/bin/srun -t1 --dependency=afterany:20185 /home/alex/slurm/18.08/install/bin/scontrol show job 20185
srun: job 20186 queued and waiting for resources
srun: job 20186 has been allocated resources
JobId=20185 JobName=test1.42.input1
   UserId=alex(1000) GroupId=docker(130) MCS_label=N/A
   Priority=50000 Nice=0 Account=acct1 QOS=normal
   JobState=COMPLETED Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:11 TimeLimit=00:01:00 TimeMin=N/A
   SubmitTime=2019-04-11T15:30:27 EligibleTime=2019-04-11T15:30:27
   AccrueTime=2019-04-11T15:30:27
   StartTime=2019-04-11T15:30:27 EndTime=2019-04-11T15:30:38 Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   LastSchedEval=2019-04-11T15:30:27
   Partition=p1 AllocNode:Sid=polaris:10058
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=compute1
   BatchHost=compute1
   NumNodes=1 NumCPUs=2 NumTasks=0 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=2,mem=1600M,node=1,billing=2
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryCPU=800M MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/home/alex/slurm/source/testsuite/expect/test1.42.input1
   WorkDir=/home/alex/slurm/source/testsuite/expect
   StdErr=/dev/null
   StdIn=/dev/null
   StdOut=/dev/null
   Power=


SUCCESS

if that helps for reproducing.
Comment 12 Alejandro Sanchez 2019-05-30 05:25:37 MDT
Marshall, I already worked with the friend bug "memory is overallocated", also working on another bug in the area of will_run_test. I'm also annoyed enough by this error at this point... do you mind if I steal it from you? :)

After adding some debug logs:

slurmctld: sched: _slurm_rpc_allocate_resources JobId=22882 NodeList=compute1 usec=2108
slurmctld: _job_complete: JobId=22882 WEXITSTATUS 0
slurmctld: select_p_job_fini: calling rm_job_res for JobId=22882
slurmctld: select/cons_tres: rm_job_res: node compute1 removing memory (1600-1600) for JobId=22882
slurmctld: _job_complete: JobId=22882 done
slurmctld: will_run_test: future_usage = _dup_node_usage(select_node_usage);
slurmctld: will_run_test: future_usage[0].alloc_memory=0
slurmctld: will_run_test: p2 calling rm_job_res to remove JobId=22882 to see if JobId=22883 will run when the former ends
slurmctld: error: select/cons_tres: rm_job_res: node compute1 memory is under-allocated (0-1600) for JobId=22882

Looks like the problem is we are double deallocating resources for completing jobs that are removed in will_run_test when emulating a future scenario. The first deallocation happens when the first job finishes:

slurmctld: select_p_job_fini: calling rm_job_res for JobId=22882
slurmctld: select/cons_tres: rm_job_res: node compute1 removing memory (1600-1600) for JobId=22882

At this point, we have already removed the resources from the node usage. Then next job triggers will_run_test, which builds a list of candidate jobs to remove resources from in an iterative way and predict when/where the job can start to estime the start time. Since completing jobs are included in the candidate list, we call rm_job_res again for jobs that already deallocated their resources, and that triggers the error.

Working on a fix now.
Comment 13 Alejandro Sanchez 2019-05-30 05:34:46 MDT
I want to note that the second rm_job_res is performed on duped structs:

future_part = _dup_part_data(select_part_record);
future_usage = _dup_node_usage(select_node_usage);

and thus the "double deallocation" doesn't actually happen on the original resources but on the duped ones. This means the error is less concerning as it could be, but still, we need to fix it since it can impact accurate predictions or mess with the future_* structs.
Comment 20 Marshall Garey 2019-06-11 10:51:54 MDT
*** Ticket 7221 has been marked as a duplicate of this ticket. ***
Comment 21 Marshall Garey 2019-06-11 10:54:25 MDT
(In reply to Regine Gaudin from comment #0)
> Hello
> As suggested in bug 6879, I'm opening this bug for annoying messages in
> slurmctld.log filling it too fast:
> "Hi
> 
> I'm updating this bug as CEA is also encountering memory under-allocated
> errors
>  you have mentionned (bug 6769), filling slurmctld.log
> error: select/cons_res: node machine1234 memory is under-allocated
> (0-188800) for JobID=XXXXXX
> same one ...repeated
> 
> As you wrote "there are proposed fixes for both issues I mentioned
> (accrue_cnt underflow and memory under-allocated errors)", I let us known
> that CEA would be also interested in proposed fixes. slurm controller is
> 18.08.06 and clients in 17.11.6 but will be upgraded soon in 18.08.06
> 
> Thanks
> 
> Regine"
> 
> [tag] [reply] [−] Comment 11 Marshall Garey 2019-06-10 10:22:12 MDT
> 
> Regine - the patches for both bugs are pending internal QA/review. They'll
> both definitely be in 19.05, and probably will both be in 18.08. Although I
> hope they'll both be in the next tag, I can't promise that. If you'd like
> patches provided before they're in the public repo, can you create a new
> ticket for that?
> 
> 
> Thanks for  providing patches for 18
> 
> Regine

I'll let Alex or others comment further. However, I have to redact my statement that they'll both "probably" be in 18.08. I found out that this one is being written for 19.05 and not 18.08, so further discussion will be needed to determine if it will be backported to 18.08.
Comment 41 Alejandro Sanchez 2019-08-15 06:04:15 MDT
Hi,

this has been fixed in the following wall of commits in 19.05:

2dd1f448ca
0666db61ca
61269349c3

The select/cons_res and select/cons_tres plugins contain many parts with same logic. In master branch (future 20.02 tag) there's been some code refactoring to put this logic in a select/cons_common place. That's why merging the previous commits up to master required some extra work. Also while working on this bug we noticed there was unneeded Cray NHC logic around the same fix area and this has been removed as well. These are the 20.02 commits reflecting all these changes:

d4913ae9a1a3
889615a6f4f8
fdb9474e9aa0
6b4d41d037ac

I'm closing this bug. Please re-open if there's anything else from there. Thanks.
Comment 42 Jason Booth 2019-10-04 09:58:32 MDT
*** Ticket 7866 has been marked as a duplicate of this ticket. ***