11149 – How to monitor slurm jobs that are blocking the slurm queue

Ticket 11149 - How to monitor slurm jobs that are blocking the slurm queue

Summary: How to monitor slurm jobs that are blocking the slurm queue

Status:	RESOLVED TIMEDOUT

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Scheduling (show other tickets)
Version:	- Unsupported Older Versions
Hardware:	Linux Linux

Severity:	4 - Minor Issue
Assignee:	Carlos Tripiana Montes
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2021-03-19 04:32 MDT by Hjalti Sveinsson
Modified:	2021-04-13 05:58 MDT (History)
CC List:	0 users

See Also:
Site:	deCODE
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Hjalti Sveinsson 2021-03-19 04:32:49 MDT

Hi, 

we are seeing issues where one user request certain number of cores and certain amount of memory and it seems to block all other jobs from starting. 

Today we had a user request 24cores and 350GB of memory per job. I saw that there were 339 jobs pending for him and only 13000 / 37000 cores were being used on the cluster and a lot of memory available across the nodes. 

This user canceled these 339 jobs and the floodgates opened, i.e. a lot of jobs were able to be started on nodes. 

2 questions related to this.

1. How can we monitor this activity, is there any command/script that you can share with us that shows when this behavior starts so we can respond promptly or have some sort of automation happen that solves this issues.
2. Is there a way for us to change the backfill option so this does not happen. That is, that jobs that are waiting in the queue get started, i.e. small jobs.

regards,
Hjalti

Comment 2 Carlos Tripiana Montes 2021-03-19 11:18:56 MDT

Hi Hjalti,

For checking the cluster status in terms of jobs and nodes, squeue and sinfo are best options. You can automate some procedure checking the output from these commands.

Regarding se 2nd question, a copy of the slurm.conf would be much appreciated. Also, from our records, we have some conf files already provided by deCODE for "hpc-sequor", "lhpc", "ru-hpc-test". Is this issue related to any of these?

Thanks.

Comment 3 Carlos Tripiana Montes 2021-03-26 03:16:51 MDT

Hi Hjalti,

Whenever you have time please take a look to my previous answer and tell me if this is what you are looking for. Also, provide us the info I've requested, if possible.

Thanks.

Comment 4 Carlos Tripiana Montes 2021-04-13 05:58:08 MDT

Going to close the issue as timed out. Please, feel free to reopen it if necessary. Thanks.