Ticket 4482 - slurmdbd exhausting memory
Summary: slurmdbd exhausting memory
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmdbd (show other tickets)
Version: 17.02.7
Hardware: Linux Linux
: 4 - Minor Issue
Assignee: Director of Support
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2017-12-07 05:16 MST by Kolbeinn Josepsson
Modified: 2017-12-18 11:46 MST (History)
0 users

See Also:
Site: deCODE
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
slurm.conf slurmdbd.log slurmctld.log (1.42 MB, application/x-zip-compressed)
2017-12-07 05:16 MST, Kolbeinn Josepsson
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description Kolbeinn Josepsson 2017-12-07 05:16:35 MST
Created attachment 5685 [details]
slurm.conf slurmdbd.log slurmctld.log

Hi, we have occasionally had issues where slurmdbd exhausts the head node memory. We are running the slurmctld and slurmdbd/mysql on the same machine, this is physical server with 16 cpu cores and 128GB memory.
 
Dec 04 19:45:42 ru-lhpc-head.decode.is kernel: Out of memory: Kill process 11968 (slurmdbd) score 873 or sacrifice child
Dec 04 19:45:42 ru-lhpc-head.decode.is kernel: Killed process 11968 (slurmdbd) total-vm:117342548kB, anon-rss:116149264kB, fil
Dec 04 19:45:48 ru-lhpc-head.decode.is systemd[1]: slurmdbd.service: main process exited, code=killed, status=9/KILL
Dec 04 19:45:48 ru-lhpc-head.decode.is systemd[1]: Unit slurmdbd.service entered failed state.
Dec 04 19:45:48 ru-lhpc-head.decode.is systemd[1]: slurmdbd.service failed.

In this recent case, one of our power users was running sacct query earlier the same day, listing all jobs for one month, but the query failed:
sacct --brief -a -S 2017-11-01 -E 2017-11-30

Following this he did two queries listing only one day in a query, those queries completed without errors.

We are not aware if any query was running during the night when the slurmdbd exhausted the memory.

Few months ago we had similar issue, but at that time, my colleague was running large query while the memory was exhausted.

Our question is if this could possible be a bug or if there is some way to workaround this, so users can not crash the slurmdbd?

Best regards,
Kolbeinn
Comment 2 Isaac Hartung 2017-12-08 14:08:08 MST
Hi Kolbeinn,

In the newest version of slurm--17.11--there is a configuration option (slurmdbd.conf) called MaxQueryTimeRange which limits the time span one can query from the database.  This could be used to prevent users (other than root and SlurmUser) from making queries large enough to crash your slurmdbd.
Comment 3 Isaac Hartung 2017-12-08 14:09:27 MST
Or, if you don't wish to upgrade at this time, you could use the job_submit plugin to do the same thing.
Comment 4 Isaac Hartung 2017-12-08 14:25:00 MST
(In reply to Isaac Hartung from comment #3)
> Or, if you don't wish to upgrade at this time, you could use the job_submit
> plugin to do the same thing.

Correction: database queries do not pass through the job_submit plugin, so this is not an option.  Therefore the upgrade would be necessary to vet for excessively large, user database queries.
Comment 5 Kolbeinn Josepsson 2017-12-18 07:39:22 MST
Hi, Thanks for the info given. If I understand you correctly, then this is working as designed.

We will consider to upgrade sooner or later (but we might wait few weeks as we upgraded to 17.02.7 few months ago).

You are free to close this ticket.

Best regards,
Kolbeinn
Comment 6 Isaac Hartung 2017-12-18 11:46:10 MST
Ok, thanks.