Ticket 11149

Summary: How to monitor slurm jobs that are blocking the slurm queue
Product: Slurm Reporter: Hjalti Sveinsson <hjalti.sveinsson>
Component: SchedulingAssignee: Carlos Tripiana Montes <tripiana>
Status: RESOLVED TIMEDOUT QA Contact:
Severity: 4 - Minor Issue    
Priority: ---    
Version: - Unsupported Older Versions   
Hardware: Linux   
OS: Linux   
Site: deCODE Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Hjalti Sveinsson 2021-03-19 04:32:49 MDT
Hi, 

we are seeing issues where one user request certain number of cores and certain amount of memory and it seems to block all other jobs from starting. 

Today we had a user request 24cores and 350GB of memory per job. I saw that there were 339 jobs pending for him and only 13000 / 37000 cores were being used on the cluster and a lot of memory available across the nodes. 

This user canceled these 339 jobs and the floodgates opened, i.e. a lot of jobs were able to be started on nodes. 

2 questions related to this.

1. How can we monitor this activity, is there any command/script that you can share with us that shows when this behavior starts so we can respond promptly or have some sort of automation happen that solves this issues.
2. Is there a way for us to change the backfill option so this does not happen. That is, that jobs that are waiting in the queue get started, i.e. small jobs.

regards,
Hjalti
Comment 2 Carlos Tripiana Montes 2021-03-19 11:18:56 MDT
Hi Hjalti,

For checking the cluster status in terms of jobs and nodes, squeue and sinfo are best options. You can automate some procedure checking the output from these commands.

Regarding se 2nd question, a copy of the slurm.conf would be much appreciated. Also, from our records, we have some conf files already provided by deCODE for "hpc-sequor", "lhpc", "ru-hpc-test". Is this issue related to any of these?

Thanks.
Comment 3 Carlos Tripiana Montes 2021-03-26 03:16:51 MDT
Hi Hjalti,

Whenever you have time please take a look to my previous answer and tell me if this is what you are looking for. Also, provide us the info I've requested, if possible.

Thanks.
Comment 4 Carlos Tripiana Montes 2021-04-13 05:58:08 MDT
Going to close the issue as timed out. Please, feel free to reopen it if necessary. Thanks.