Ticket 4482

Summary:	slurmdbd exhausting memory
Product:	Slurm	Reporter:	Kolbeinn Josepsson <kolbeinn.josepsson>
Component:	slurmdbd	Assignee:	Director of Support <support>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	4 - Minor Issue
Priority:	---
Version:	17.02.7
Hardware:	Linux
OS:	Linux
Site:	deCODE	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	slurm.conf slurmdbd.log slurmctld.log

Description Kolbeinn Josepsson 2017-12-07 05:16:35 MST

Created attachment 5685 [details]
slurm.conf slurmdbd.log slurmctld.log

Hi, we have occasionally had issues where slurmdbd exhausts the head node memory. We are running the slurmctld and slurmdbd/mysql on the same machine, this is physical server with 16 cpu cores and 128GB memory.
 
Dec 04 19:45:42 ru-lhpc-head.decode.is kernel: Out of memory: Kill process 11968 (slurmdbd) score 873 or sacrifice child
Dec 04 19:45:42 ru-lhpc-head.decode.is kernel: Killed process 11968 (slurmdbd) total-vm:117342548kB, anon-rss:116149264kB, fil
Dec 04 19:45:48 ru-lhpc-head.decode.is systemd[1]: slurmdbd.service: main process exited, code=killed, status=9/KILL
Dec 04 19:45:48 ru-lhpc-head.decode.is systemd[1]: Unit slurmdbd.service entered failed state.
Dec 04 19:45:48 ru-lhpc-head.decode.is systemd[1]: slurmdbd.service failed.

In this recent case, one of our power users was running sacct query earlier the same day, listing all jobs for one month, but the query failed:
sacct --brief -a -S 2017-11-01 -E 2017-11-30

Following this he did two queries listing only one day in a query, those queries completed without errors.

We are not aware if any query was running during the night when the slurmdbd exhausted the memory.

Few months ago we had similar issue, but at that time, my colleague was running large query while the memory was exhausted.

Our question is if this could possible be a bug or if there is some way to workaround this, so users can not crash the slurmdbd?

Best regards,
Kolbeinn

Comment 2 Isaac Hartung 2017-12-08 14:08:08 MST

Hi Kolbeinn,

In the newest version of slurm--17.11--there is a configuration option (slurmdbd.conf) called MaxQueryTimeRange which limits the time span one can query from the database.  This could be used to prevent users (other than root and SlurmUser) from making queries large enough to crash your slurmdbd.

Comment 3 Isaac Hartung 2017-12-08 14:09:27 MST

Or, if you don't wish to upgrade at this time, you could use the job_submit plugin to do the same thing.

Comment 4 Isaac Hartung 2017-12-08 14:25:00 MST

(In reply to Isaac Hartung from comment #3)
> Or, if you don't wish to upgrade at this time, you could use the job_submit
> plugin to do the same thing.

Correction: database queries do not pass through the job_submit plugin, so this is not an option.  Therefore the upgrade would be necessary to vet for excessively large, user database queries.

Comment 5 Kolbeinn Josepsson 2017-12-18 07:39:22 MST

Hi, Thanks for the info given. If I understand you correctly, then this is working as designed.

We will consider to upgrade sooner or later (but we might wait few weeks as we upgraded to 17.02.7 few months ago).

You are free to close this ticket.

Best regards,
Kolbeinn

Comment 6 Isaac Hartung 2017-12-18 11:46:10 MST

Ok, thanks.