13617 – NRW20200730: Consulting w/ Sr. Engineer

Ticket 13617 - NRW20200730: Consulting w/ Sr. Engineer

Summary: NRW20200730: Consulting w/ Sr. Engineer

Status:	RESOLVED INFOGIVEN

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Other (show other tickets)
Version:	20.02.4
Hardware:	Linux Linux

Severity:	4 - Minor Issue
Assignee:	Ben Roberts
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2022-03-15 08:14 MDT by LiDO Team
Modified:	2022-05-20 10:08 MDT (History)
CC List:	1 user (show)

See Also:
Site:	NRW
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	CentOS
Machine Name:	LiDO3
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
slurm.conf (20.69 KB, text/plain) 2022-03-17 09:32 MDT, LiDO Team	Details
squeue + sinfo output (1.51 MB, application/x-zip-compressed) 2022-04-07 07:24 MDT, LiDO Team	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description LiDO Team 2022-03-15 08:14:40 MDT

Hi,

(I hope this is the right channel for this request.)

As part of HPC.NRW, we would like to arrange a date for the site-related consulting session for TU Dortmund.

Concerning our own calendar: Any workday, beginning with the 25th April, would be fine.


Best regards,
Dirk Ribbrock

Comment 2 Nick Ihli 2022-03-15 08:54:46 MDT

Hi Dirk,

What are the main topics/questions you would like to cover during this session?

Thanks,
Nick

Comment 3 LiDO Team 2022-03-17 09:32:27 MDT

Created attachment 23918 [details]
slurm.conf

Comment 4 LiDO Team 2022-03-17 09:33:16 MDT

Hi Nick,

After some discussion, we think that the greatest benefit for us would be gained if you would help us with some issues we have with our SLURM configuration.

We know of at least three issues that have haunted us for a couple of years:
(greatest impact first)

1) We have three separate partitions for short running, medum range and long-running jobs. Nodes are partially shared between these partitions: all nodes of partition 'long' are in 'med' and 'short'. Partition 'med' has a few additional nodes which are as well present in partition 'short', partition 'short' has some additional nodes exclusively.
In slurm.conf syntax, this translates to> PartitionName=short            Nodes=cstd01-[001-244]   State=DOWN  DefMemPerCPU=512 Priority=2000  MaxTime=2:00:00  Default=YES
> PartitionName=med              Nodes=cstd01-[009-244]   State=DOWN  DefMemPerCPU=512 Priority=2000  MaxTime=2:00:00  Default=YES
> PartitionName=long             Nodes=cstd01-[025-244]   State=DOWN  DefMemPerCPU=512 Priority=2000  MaxTime=2:00:00  Default=YESThis ensures that we can cater for both code developers and productive jobs. But the cost is that often nodes in 'short' and 'med' are idle, because no one is currently developping, everybody runs production runs.
Isn't there a way or a QOS rule that would ensure that at any given time a job submitted requiring less than 2 hours run time would not have to wait for more than on average 1 hour? Without having to dedicate some compute nodes exclusively to a partition?

2) We have been observing the following with slurmctld:
slurmctld reports "error: Node ... not responding, setting DOWN" (and cancels/requeues jobs running there) for slurm.conf's MessageTimeout being set to 45s despite the (shared) compute node itself not being overloaded, not reporting any network downtime nor the switches connecting compute node and the host running slurmctld reporting any network interruptions.
This happens intermittendly, without any pattern that we know of and we have no idea where to start digging.
Is that a known problem? What would be a reasonable value for MessageTimeout in case of node sharing (at max 20 concurrent slurm jobs) and no (significant) overload
Might this be related to the number of Slurm jobs sharing the same compute node?


3) We would like to bring down the values of 'Idle' and 'Reserved' as reported by 'sreport:> sreport cluster Utilization start=2021-12-01 end=2022-02-28T23:59:59
> --------------------------------------------------------------------------------
> Cluster Utilization 2021-12-01T00:00:00 - 2022-02-28T23:59:59
> Usage reported in CPU Minutes
> --------------------------------------------------------------------------------
>   Cluster   Allocated       Down PLND Dow        Idle    Reserved    Reported 
> --------- ----------- ---------- -------- ----------- ----------- ----------- 
>     lido3   627004443   28144221        0   250637996   283426289  1189212949 

Because for a typical three month period they amount to almost 50% of available non-down compute time (Reported minus Down).
We feel that this value is rather suboptimal and could be increased, especially as we allow for fine-grained core-wise ressource allocations.

I attach the slurm.conf file.
If helpful, I can provide the slurm database dump starting from 2017.

Best regards
Dirk Ribbrock

Comment 6 Nick Ihli 2022-03-17 13:20:28 MDT

Dirk,

Thank you for the details. I'll be having Ben Roberts help you on this. He'll be reaching out with any more questions he has and availability for that week of the April 25th for a time to meet.

Thanks,
Nick

Comment 7 Ben Roberts 2022-03-17 14:00:27 MDT

Hi Dirk,

I've been looking over the information you sent and I would be curious to see how the jobs look for the last 3 months shown in your sreport output.  How large is the database dump you have for everything since 2017?  Assuming it is large can I have you get just the last 3 months of data you were looking at?  You can gather this by using mysqldump to look for jobs with a submit_time greater than the time stamp for Dec 01.  It would look something like this, substituting your database name for <database_name>:

mysqldump -u slurm -p <database_name> --single-transaction --tables lido3_job_table --where "time_submit>1638316800" > jobs.sql

Thanks,
Ben

Comment 8 LiDO Team 2022-03-18 03:26:47 MDT

Hi Ben,

The complete database dump starting from 2017 is rougly 2 Gigabytes large.

I've uploaded it to our one-click hosting service.
You should be able to download it via
https://tu-dortmund.sciebo.de/s/ZzyFtOu20SUc2So

If this works for you, all is good.
If not, i can create a smaller archive of the last 3 month.

Best regards
Dirk

Comment 9 Ben Roberts 2022-03-18 13:16:00 MDT

Hi Dirk,

Thanks for sending that information.  I was wondering whether the majority of the jobs on your system requested a large number of CPUs which might cause jobs to spend a large amount of time queued and explain why you have so much idle time.  It doesn't look like this is the case though.  The majority of your jobs are for a small number of CPUs.  

MariaDB [slurm_acct_db]> select cpus_req, count(*) from lido3_job_table where time_submit>1638316800 group by cpus_req;
+----------+----------+
| cpus_req | count(*) |
+----------+----------+
|        1 |   417655 |
|        2 |     5415 |
|        3 |     2701 |
|        4 |    16156 |
|        5 |   197019 |
|        6 |      362 |
|        7 |        6 |
|        8 |   130157 |
|        9 |      244 |
|       10 |    26247 |
|       11 |       92 |
|       12 |     1849 |
|       13 |       11 |
|       14 |       82 |
|       15 |     1376 |
|       16 |    10283 |
|       17 |      463 |
|       18 |      349 |
|       19 |       15 |
|       20 |    88234 |
|       21 |      106 |
|       24 |      581 |
|       25 |       22 |
|       28 |      839 |
|       30 |        8 |
|       32 |     1918 |
|       35 |        2 |
|       38 |        7 |
|       39 |       30 |
|       40 |     1652 |
|       42 |        1 |
|       44 |       29 |
|       48 |     3957 |
|       51 |        4 |
|       60 |      490 |
|       64 |      678 |
|       80 |      912 |
|       96 |        7 |
|      100 |       78 |
|      101 |        3 |
|      120 |      720 |
|      128 |       84 |
|      129 |        1 |
|      140 |        4 |
|      144 |        5 |
|      160 |      610 |
|      180 |        4 |
|      192 |       53 |
|      200 |     1368 |
|      240 |       41 |
|      241 |        2 |
|      256 |        5 |
|      300 |        3 |
|      320 |       52 |
|      400 |        6 |
|      432 |        3 |
|      500 |        1 |
|      540 |        1 |
|      558 |        1 |
|      578 |        3 |
|      600 |        2 |
|      612 |        2 |
|      630 |        1 |
|      640 |        2 |
|      700 |        1 |
|     1201 |        4 |
|     1280 |        2 |
|     2000 |        2 |
|     2401 |        3 |
|     2560 |        1 |
+----------+----------+
70 rows in set (7.439 sec)



Do you notice that the cluster is under-utilized during normal operation, or does it look like things are generally busy?  

We can discuss this further when we meet on the week of Apr 25th.  Do you have a day that week that works better for you?

Thanks,
Ben

Comment 10 LiDO Team 2022-03-21 02:53:08 MDT

Hi,

concerning your question

> Do you notice that the cluster is under-utilized during normal operation, or does it look like things are generally busy?  

We do not monitor the utilization systematically.
We do however look at our queues a few times a day and are very confident, that the queues are always filled up with a bunch of jobs.
(Some of our R users start new jobs by script as soon as their old jobs are finished.)
Usually, we even have jobs waiting that allocate only one core and still some nodes in the 'short' partition are empty.

I hope to give you a date proposal tomorrow; I need to check with my colleagues beforehand.
Which timezone should we adapt to?

Best regards
Dirk

Comment 11 LiDO Team 2022-03-21 03:40:42 MDT

The curcial sentence got missing:

Is there a SLURM command to measure the system utilization in the past?

Comment 12 Ben Roberts 2022-03-23 09:08:19 MDT

Hi Dirk,

I am in the central time zone in the US (UTC−05:00).

The command we have to easily see system utilization for a given time period would be the sreport command.  You can also query job statistics with sacct to get more specific information about which jobs were running at a given time.  It sounds like you're saying that the system usually remains pretty busy, but you do see cases where there are nodes idle and small jobs queued that look like they could use those nodes.  That sounds like something that would be worth investigating further to see why jobs aren't starting on idle nodes.  We can discuss things that might be causing this and what we can do to investigate on our call, but if you would like to look into it sooner than that please let me know.

Thanks,
Ben

Comment 13 LiDO Team 2022-03-24 09:37:50 MDT

Hi Ben,

we are here (or will be in DST) in UTC+2.
That gives us 7 hours difference.

When do you usually start your day?
I would suggest something like 14 or 15 o'clock in UTC+2.
This would mean 7 or 8 o'clock a.m. in your time.

Discussing the aforementioned topics in our call sounds good to me.

What about the 4th or 5th of May?
(I just learned that we will be on a conference in the week vom 25. April to 29. April )

Best regards
Dirk Ribbrock

Comment 14 Ben Roberts 2022-03-24 13:18:24 MDT

I usually start at 9AM, but I'm perfectly willing to meet with you at 8AM to accommodate the time difference.  Thursday mornings are a little better for me to meet at 8:00.  So if May 5th at 15:00 DST works for you I'll send out the invite.  Would the invite just go to lido-team.itmc@lists.tu-dortmund.de?  If you need anyone else added just let me know.

Thanks,
Ben

Comment 15 LiDO Team 2022-03-25 03:28:45 MDT

Hi Ben,

May 5th at 15:00 DST is perfect for us.
You can send the invite simply to lido-team.itmc@lists.tu-dortmund.de , i will distribute the link to the video call accordingly.
Our team consits of 6 Admins and hopefully all 6 will join the meeting.

Are there any further preparations necessary?
Do you need an account of some sort or what is your usual approach in similar events?

Best regards
Dirk Ribbrock

Comment 16 Ben Roberts 2022-03-25 09:17:03 MDT

I've sent the invite, so let me know if it doesn't show up for some reason.

A consulting session is generally a time to go over any situations that you aren't quite sure how to approach, or to clarify how something is supposed to behave, or to discuss unexpected behavior you're seeing.  If something is found to be a bug then that is usually better handled outside of the consulting session where there can be some time spent looking into the code and how the situation should best be handled.  If it makes more sense to look at what's happening on your system in real time then we can have someone on your team share their screen to show the problem.

For the question about sreport, I do think it would be good to gather some additional information in preparation for the call.  I think that if you could put together a simple script that gathered the output of 'sinfo' and 'squeue' every 5 minutes for an hour (starting and ending on the hour) and sent the output to a file, and then ran 'sreport' for the same window of time, that would give some good data to evaluate and discuss on the call.

For your issue with the nodes going down, it looks like there is some history there with the changes I see in your slurm.conf.  I was planning on discussing what happened there and seeing if we can identify any patterns or anything that might be causing delayed responses.  I do think that it would be beneficial to look at the slurmd logs for a node that had this happen and comparing them with the slurmctld logs for the same time period.  Increasing the log level for a node that fails would be nice, but it sounds like this is not something that you can predict and it would be hard to know where to enable debug logging without leaving it for a long time and hoping you catch it.  

If you have any additional topics you want to discuss on the call then please let me know.  

Thanks,
Ben

Comment 17 LiDO Team 2022-04-07 07:24:31 MDT

Created attachment 24294 [details]
squeue + sinfo output

Comment 18 LiDO Team 2022-04-07 07:34:49 MDT

Hi Ben,

we got your invitation.

I gathered the informations you asked for (squeue and sinfo).
I ran it for some one-hour-windows with a frequency of 5 minutes.

The corresponding sreport outputs:

root@slurm: /root>sreport cluster Utilization start=2022-04-06T17:00:00 end=2022-04-06T17:55:00
--------------------------------------------------------------------------------
Cluster Utilization 2022-04-06T17:00:00 - 2022-04-06T17:59:59
Usage reported in CPU Minutes
--------------------------------------------------------------------------------
  Cluster Allocate     Down PLND Dow     Idle Reserved Reported
--------- -------- -------- -------- -------- -------- --------
    lido3   448287     6150        0        0    99483   553920


root@slurm: /root>sreport cluster Utilization start=2022-04-06T20:00:00 end=2022-04-06T20:55:00
--------------------------------------------------------------------------------
Cluster Utilization 2022-04-06T20:00:00 - 2022-04-06T20:59:59
Usage reported in CPU Minutes
--------------------------------------------------------------------------------
  Cluster Allocate     Down PLND Dow     Idle Reserved Reported
--------- -------- -------- -------- -------- -------- --------
    lido3   432755     8400        0    57620    55146   553920



root@slurm: /root>sreport cluster Utilization start=2022-04-06T23:00:00 end=2022-04-06T23:55:00
--------------------------------------------------------------------------------
Cluster Utilization 2022-04-06T23:00:00 - 2022-04-06T23:59:59
Usage reported in CPU Minutes
--------------------------------------------------------------------------------
  Cluster Allocate     Down PLND Dow     Idle Reserved Reported
--------- -------- -------- -------- -------- -------- --------
    lido3   431913     8400        0   100211    13396   553920



root@slurm: /root>sreport cluster Utilization start=2022-04-07T02:00:00 end=2022-04-07T02:55:00
--------------------------------------------------------------------------------
Cluster Utilization 2022-04-07T02:00:00 - 2022-04-07T02:59:59
Usage reported in CPU Minutes
--------------------------------------------------------------------------------
  Cluster Allocate     Down PLND Dow     Idle Reserved Reported
--------- -------- -------- -------- -------- -------- --------
    lido3   398637     8400        0   133719    13164   553920


root@slurm: /root>sreport cluster Utilization start=2022-04-07T05:00:00 end=2022-04-07T05:55:00
--------------------------------------------------------------------------------
Cluster Utilization 2022-04-07T05:00:00 - 2022-04-07T05:59:59
Usage reported in CPU Minutes
--------------------------------------------------------------------------------
  Cluster Allocate     Down PLND Dow     Idle Reserved Reported
--------- -------- -------- -------- -------- -------- --------
    lido3   381424     8932        0   138978    24586   553920


root@slurm: /root>sreport cluster Utilization start=2022-04-07T08:00:00 end=2022-04-07T08:55:00
--------------------------------------------------------------------------------
Cluster Utilization 2022-04-07T08:00:00 - 2022-04-07T08:59:59
Usage reported in CPU Minutes
--------------------------------------------------------------------------------
  Cluster Allocate     Down PLND Dow     Idle Reserved Reported
--------- -------- -------- -------- -------- -------- --------
    lido3   353994     9600        0   179439    10887   553920



root@slurm: /root>sreport cluster Utilization start=2022-04-07T11:00:00 end=2022-04-07T11:55:00
--------------------------------------------------------------------------------
Cluster Utilization 2022-04-07T11:00:00 - 2022-04-07T11:59:59
Usage reported in CPU Minutes
--------------------------------------------------------------------------------
  Cluster Allocate     Down PLND Dow     Idle Reserved Reported
--------- -------- -------- -------- -------- -------- --------
    lido3   377831     5847        0   136397    33845   553920


root@slurm: /root>sreport cluster Utilization start=2022-04-07T14:00:00 end=2022-04-07T14:55:00
--------------------------------------------------------------------------------
Cluster Utilization 2022-04-07T14:00:00 - 2022-04-07T14:59:59
Usage reported in CPU Minutes
--------------------------------------------------------------------------------
  Cluster Allocate     Down PLND Dow     Idle Reserved Reported
--------- -------- -------- -------- -------- -------- --------
    lido3   392441     2400        0   110108    48971   553920


If theres anything else needed for our meeting, please let me know.

Seeing forward to 'meet' you
Dirk Ribbrock

Comment 20 Ben Roberts 2022-05-05 08:55:42 MDT

Thanks for joining the call today. Here's a brief recap of what we talked about.

For the situation where you want to make sure jobs in the short and medium partitions are able to start within a few hours, we talked about creating a floating reservation for them. If you made the reservation so that it was always 3 hours in advance, this would allow jobs with wall times of less than 3 hours to run in that space, but when a job that qualified for the reservation came along it would be guaranteed to start within a 3 hour window. This is done with the TIME_FLOAT flag and you can read more about reservations like this here:
https://slurm.schedmd.com/reservations.html#float

We also talked about your nodes occasionally being marked as down when there were NIC issues. I think the timeout most likely to help with this is the TcpTimeout, which defaults to 2 seconds.
https://slurm.schedmd.com/slurm.conf.html#OPT_TCPTimeout

We discussed the fact that you have several users who tend to submit more jobs than the rest of your users. This can lead to the jobs from the other users not getting started as soon as they should. I think this can be addressed by modifying the bf_max_job_user and bf_max_job_test parameters (currently bf_max_job_user=1700 and bf_max_job_test=400). Increasing bf_max_job_test will allow more jobs to be evaluated each backfill iteration, I would probably start by increasing it to around 600 and seeing how that affects your cluster. I would recommend setting bf_max_job_user to something like 50 so that the users with a lot of jobs can still get a good number of jobs started each iteration, but it allows time for other users to have jobs started too. If you find that the users with a lot of jobs aren't getting enough started then you can adjust this value accordingly until you find a happy medium.
https://slurm.schedmd.com/slurm.conf.html#OPT_bf_max_job_test=#
https://slurm.schedmd.com/slurm.conf.html#OPT_bf_max_job_user=#

You also have a situation where users are submitting a large number of jobs to the queue and having all their jobs accrue age based priority, resulting in their jobs having the highest priority for a long time. You can control this by specifying the number of jobs that can accrue priority at once with the MaxJobsAccrue parameter in sacctmgr.
https://slurm.schedmd.com/sacctmgr.html#OPT_MaxJobsAccrue

You have a situation where some of the nodes on your cluster are exclusive to different departments who have paid for the hardware. You want to find a way to separate the reporting of these nodes from the general cluster as well as make the usage of the exclusive nodes not affect the fairshare value of the users when they want to use the commonly available resources. I think the best way to handle this would be with a multiple cluster configuration. One cluster can be for the general queue and the other for the dedicated nodes. Users can request both clusters at submit time and it will go to the best one. This would keep the accounting separate for the two types of nodes and allow users to take advantage of both.
https://slurm.schedmd.com/multi_cluster.html

You also have noticed situations where large jobs are getting starved. It sounds like there may be other jobs that are submitted at times that have higher priority than these large jobs, which could cause the large jobs to lose their priority reservation. You can increase the number of jobs that keep a priority reservation with bf_job_part_count_reserve, but I think the better solution is to make sure that these large jobs always have the highest priority. You can do this by increasing the weight for the job size based priority and/or by increasing the priority of the partition that would have these large jobs in it.
https://slurm.schedmd.com/slurm.conf.html#OPT_bf_job_part_count_reserve=#
https://slurm.schedmd.com/slurm.conf.html#OPT_PriorityWeightJobSize
https://slurm.schedmd.com/slurm.conf.html#OPT_PriorityJobFactor

I know this is a lot to process, so let me know if you have any questions about these parameters or if you find that some of them aren't having the desired effect.

Thanks,
Ben