| Summary: | slurmdbd exhausting memory | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Kolbeinn Josepsson <kolbeinn.josepsson> |
| Component: | slurmdbd | Assignee: | Director of Support <support> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | ||
| Version: | 17.02.7 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | deCODE | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: | slurm.conf slurmdbd.log slurmctld.log | ||
Hi Kolbeinn, In the newest version of slurm--17.11--there is a configuration option (slurmdbd.conf) called MaxQueryTimeRange which limits the time span one can query from the database. This could be used to prevent users (other than root and SlurmUser) from making queries large enough to crash your slurmdbd. Or, if you don't wish to upgrade at this time, you could use the job_submit plugin to do the same thing. (In reply to Isaac Hartung from comment #3) > Or, if you don't wish to upgrade at this time, you could use the job_submit > plugin to do the same thing. Correction: database queries do not pass through the job_submit plugin, so this is not an option. Therefore the upgrade would be necessary to vet for excessively large, user database queries. Hi, Thanks for the info given. If I understand you correctly, then this is working as designed. We will consider to upgrade sooner or later (but we might wait few weeks as we upgraded to 17.02.7 few months ago). You are free to close this ticket. Best regards, Kolbeinn Ok, thanks. |
Created attachment 5685 [details] slurm.conf slurmdbd.log slurmctld.log Hi, we have occasionally had issues where slurmdbd exhausts the head node memory. We are running the slurmctld and slurmdbd/mysql on the same machine, this is physical server with 16 cpu cores and 128GB memory. Dec 04 19:45:42 ru-lhpc-head.decode.is kernel: Out of memory: Kill process 11968 (slurmdbd) score 873 or sacrifice child Dec 04 19:45:42 ru-lhpc-head.decode.is kernel: Killed process 11968 (slurmdbd) total-vm:117342548kB, anon-rss:116149264kB, fil Dec 04 19:45:48 ru-lhpc-head.decode.is systemd[1]: slurmdbd.service: main process exited, code=killed, status=9/KILL Dec 04 19:45:48 ru-lhpc-head.decode.is systemd[1]: Unit slurmdbd.service entered failed state. Dec 04 19:45:48 ru-lhpc-head.decode.is systemd[1]: slurmdbd.service failed. In this recent case, one of our power users was running sacct query earlier the same day, listing all jobs for one month, but the query failed: sacct --brief -a -S 2017-11-01 -E 2017-11-30 Following this he did two queries listing only one day in a query, those queries completed without errors. We are not aware if any query was running during the night when the slurmdbd exhausted the memory. Few months ago we had similar issue, but at that time, my colleague was running large query while the memory was exhausted. Our question is if this could possible be a bug or if there is some way to workaround this, so users can not crash the slurmdbd? Best regards, Kolbeinn