Hi, we are seeing issues where one user request certain number of cores and certain amount of memory and it seems to block all other jobs from starting. Today we had a user request 24cores and 350GB of memory per job. I saw that there were 339 jobs pending for him and only 13000 / 37000 cores were being used on the cluster and a lot of memory available across the nodes. This user canceled these 339 jobs and the floodgates opened, i.e. a lot of jobs were able to be started on nodes. 2 questions related to this. 1. How can we monitor this activity, is there any command/script that you can share with us that shows when this behavior starts so we can respond promptly or have some sort of automation happen that solves this issues. 2. Is there a way for us to change the backfill option so this does not happen. That is, that jobs that are waiting in the queue get started, i.e. small jobs. regards, Hjalti
Hi Hjalti, For checking the cluster status in terms of jobs and nodes, squeue and sinfo are best options. You can automate some procedure checking the output from these commands. Regarding se 2nd question, a copy of the slurm.conf would be much appreciated. Also, from our records, we have some conf files already provided by deCODE for "hpc-sequor", "lhpc", "ru-hpc-test". Is this issue related to any of these? Thanks.
Hi Hjalti, Whenever you have time please take a look to my previous answer and tell me if this is what you are looking for. Also, provide us the info I've requested, if possible. Thanks.
Going to close the issue as timed out. Please, feel free to reopen it if necessary. Thanks.