Ticket 3415

Summary:	Nodes dropping to "draining" with Low Real Memory error
Product:	Slurm	Reporter:	Ciaron Linstead <linstead>
Component:	slurmd	Assignee:	Tim Wickberg <tim>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	2 - High Impact
Priority:	---
Version:	16.05.8
Hardware:	Linux
OS:	Linux
Site:	PIK	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	slurm.conf slurmd excerpt for an affected node

Description Ciaron Linstead 2017-01-23 09:48:35 MST

Created attachment 3958 [details]
slurm.conf

Hi

We've just upgraded from 15.08.4 to 16.05.8.

Our problem is that many nodes are now dropping to "Draining" (some even without user applications running, and had just been booted, though others have been up for >1day) with the reason "Low Real Memory".

We have 64GB RAM per node (RealMemory=65536), initially set 3584MB DefMemPerCPU, currently down to 3000 to see if this helps.

Free memory on the node never seems to /really/ disappear, unless something is spiking that we don't see.

Excerpt from slurmctld.log for a typical node:

[2017-01-23T17:34:10.662] error: Node cs-f14c02b08 has low real_memory size (64120 < 65536)
[2017-01-23T17:34:10.662] error: Setting node cs-f14c02b08 state to DRAIN
[2017-01-23T17:34:10.662] drain_nodes: node cs-f14c02b08 state set to DRAIN

scontrol show node for this node:

NodeName=cs-f14c02b08 Arch=x86_64 CoresPerSocket=8
   CPUAlloc=0 CPUErr=0 CPUTot=16 CPULoad=0.08
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=(null)
   NodeAddr=cs-f14c02b08 NodeHostName=cs-f14c02b08 Version=16.05
   OS=Linux RealMemory=65536 AllocMem=0 FreeMem=59662 Sockets=2 Boards=1
   State=IDLE+DRAIN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   BootTime=2017-01-23T17:29:01 SlurmdStartTime=2017-01-23T17:30:59
   CapWatts=n/a
   CurrentWatts=40 LowestJoules=400 ConsumedJoules=9460
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   Reason=Low RealMemory [slurm@2017-01-23T17:34:10]

Attached: excerpt from slurmd log for this node at the time of the draining, and slurm.conf.

Best regards

Ciaron

Comment 1 Ciaron Linstead 2017-01-23 09:49:03 MST

Created attachment 3959 [details]
slurmd excerpt for an affected node

Comment 2 Tim Wickberg 2017-01-23 11:54:48 MST

Did you change any configuration at the time of the upgrade? 15.08.4 should have had the same behavior here if your config hasn't changed.

slurmd will automatically drain the node if the amount of memory reported by the OS is less than what is configured. This is designed to ensure the node is healthy, and that the server hasn't lost access to some of the CPUs or part of the memory, as you wouldn't want jobs to run on failing hardware.

In this case, slurmd is indicating the OS reports 64120 MB, where you'd specified that the node should have at least 65536 available. If you change the configuration lines to RealMemory=64000 you'd be safe. (You can always under-specify memory, and we usually recommend that as part of a new configuration - that ensures that at least some memory is not allocated out to the jobs providing some room for the OS itself).

- Tim

Comment 3 Ciaron Linstead 2017-01-23 11:59:48 MST

Hi Tim

Thanks for the fast reply!

We didn't change anything in the config, no.

I'll try out your RealMemory=64000 suggestions and see what happens overnight.

Cheers

Ciaron

Comment 4 Tim Wickberg 2017-01-23 12:13:21 MST

(In reply to Ciaron Linstead from comment #3)
> Hi Tim
> 
> Thanks for the fast reply!
> 
> We didn't change anything in the config, no.
> 
> I'll try out your RealMemory=64000 suggestions and see what happens
> overnight.

One note - you'll need to run 'scontrol reconfigure' after making the change for it to take effect; you can do that at any time.

Comment 5 Tim Wickberg 2017-01-23 12:38:01 MST

I'm looking at our internal customer config repository, and it does look like a change was made to slurm.conf that caused in the past 8 months. In the last config we had from June (bug 2762), you had RealMemory=64440, but since then it was increased to RealMemory=65536 which causes this issue.

Changing it to 64000 (or 64440) should solve this for you. Just run 'scontrol reconfigure' after making the adjustment and it should no longer drain nodes automatically.

I'm going to go ahead and mark this as resolved/infogiven as I can conclusively state that's the cause; please re-open if you have further questions.

- Tim

Comment 6 Ciaron Linstead 2017-01-23 12:44:12 MST

Hi Tim

That change (64400 -> 65536) was today (i.e. 64400 worked fine with 15.08.4, but not 16.05.8, also going from SLES11 to SLES12SP1). In trying to diagnose the problem (I just went the wrong direction!). Currently, at 64000, everything seems to stable once more.

Thanks again!

Ciaron