Ticket 15357

Summary: Timeouts from hostname resolution
Product: Slurm Reporter: Matt Ezell <ezellma>
Component: slurmctldAssignee: Marcin Stolarek <cinek>
Status: RESOLVED FIXED QA Contact:
Severity: 3 - Medium Impact    
Priority: --- CC: brian.gilmer, tim
Version: 22.05.5   
Hardware: Linux   
OS: Linux   
Site: ORNL-OLCF Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: 23.02pre1 Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: getaddrinfo reproducer
perf.data
perf.data.tar.bz2
patch rebased to 22.05 (for test purpose)

Description Matt Ezell 2022-11-04 10:40:50 MDT
On Frontier, when we (re)start slurmctld or issue a reconfig, slurmctld will often hang for multiple minutes while the nscd process goes to 100% of 1 cpu. The `scontrol reconfig` will timeout the RPC. We are looking for advice on how to avoid this.

I extracted slurm's logic for resolving hosts into a standalone program and recreated the slow behavior.  If nscd has nothing cached:
[root@fm2.frontier ~]# nscd -i hosts
[root@fm2.frontier ~]# time ./gai

real    2m41.899s
user    0m0.135s
sys     0m0.589s

But then immediately afterward, the lookup becomes almost instant:
[root@fm2.frontier ~]# time ./gai

real    0m0.131s
user    0m0.008s
sys     0m0.121s

nscd doesn't load the hosts file into cache, but when it encounters an entry that it doesn't have cached it will call out to nss. Each call to nss (seems to) read the whole hosts file for one lookup. Our hosts file is currently 4.3MiB (there may be some room to cull extraneous entries on the slurm controller node but for now we use the same hosts file on all nodes in the cluster). nscd has a positive-time-to-live after which it will dump entries - unfortunately there doesn't seem to be a difference between "local" addresses resolved via hosts and "external" addresses resolved by dns. I'm a little uncomfortable setting the TTL insanely high.

Running without nscd:
[root@fm2.frontier ~]# systemctl stop nscd
[root@fm2.frontier ~]# time ./gai 

real    1m27.919s
user    1m20.608s
sys     0m7.312s

So nscd has overhead when nothing is cached compared to straight nss, but it's that long every restart/reconfig.

The next thing I tried was to setup dnsmasq as a local-caching daemon and setup resolv.conf/nsswitch.conf to prefer a dns lookup over reading hosts:
[root@fm2.frontier ~]# time ./gai 

real    1m30.054s
user    0m0.504s
sys     0m2.310s

So slightly worse than reading the hosts file directly (this surprised me, but I guess the dns protocol overhead makes it just as slow as re-reading the file from the page cache).


So it seems my options are:
- Just deal with this being slow and timing out
- Come up with some tooling to periodically reload the nscd cache
- Use some different/better nscd replacement (not sure what's out there)
- Do <something> with slurm's cloud dns features to set addresses
- Turn this into a RFE to have slurm do something different/better

Any thoughts or advice would be very appreciated.
Comment 1 Marcin Stolarek 2022-11-07 01:33:36 MST
Matt,

Could you please share the source of "gai" and collect the profile of slurmctld while recreating the issue?

The commands should look like:
>perf record  -s --call-graph dwarf -p `pidof slurmctld` sleep 300
>perf archive perf.data

then send both perf.data.tar.bz2 and perf.data (.tar.bz2 is not a compressed version of perf.data).

I'd like to find a specific call traces responsible for this slowness.

cheers,
Marcin
Comment 2 Matt Ezell 2022-11-07 06:16:31 MST
Created attachment 27617 [details]
getaddrinfo reproducer

This reproducer duplicates slurmctld's logic to resolve hostnames to IP addresses. It resolves all of Frontier's compute nodes.
Comment 5 Matt Ezell 2022-11-07 09:00:09 MST
Created attachment 27621 [details]
perf.data

Please un-gzip before attempting to use
Comment 6 Matt Ezell 2022-11-07 09:00:44 MST
Created attachment 27622 [details]
perf.data.tar.bz2
Comment 7 Matt Ezell 2022-11-07 09:02:19 MST
The top section from the perf report is:

-   43.77%     0.20%  :7164           slurmctld                   ◆
   - 43.57% read_slurm_conf                                       ▒
      - 25.79% _validate_slurmd_addr (inlined)                    ▒
         - 25.70% slurm_set_addr                                  ▒
            - 25.55% get_addr_info                                ▒
               - 24.89% __GI_getaddrinfo (inlined)                ▒
                  - 14.75% gaih_inet.constprop.6                  ▒
                     - 13.79% __nscd_getai                        ▒
                        - 9.88% __nscd_open_socket                ▒
                           + 5.45% open_socket                    ▒
                           + 3.65% wait_on_socket                 ▒
                           + 0.77% __GI___read (inlined)          ▒
                        + 2.45% __GI___close_nocancel (inlined)   ▒
                        + 0.89% __readall                         ▒
                  + 9.69% __check_pf                              ▒
               + 0.59% snprintf (inlined)                         ▒
      + 4.55% restore_node_features                               ▒
      + 4.05% _sync_jobs_to_conf                                  ▒
      + 3.73% set_cluster_tres                                    ▒
      + 2.16% _purge_old_node_state                               ▒
      + 1.08% _build_bitmaps_pre_select (inlined)                 ▒
      + 0.65% select_g_node_init                                  ▒
        0.57% _set_features                                       ▒
Comment 9 Marcin Stolarek 2022-11-07 22:47:30 MST
Matt,

Not a part of SchedMD expertise, but did you consider using dnsmasq as a DNS cache server on the machine where slurmctld runs? It should be able to work as a primary nameserver in resolve.conf using /etc/hosts and 2nd etc. servers for other hosts resolution. As far as I remember it reads the whole /etc/hosts file on startup.

Let me know what you think or what the test shows - if you can experiment in the environment.

cheers,
Marcin
Comment 10 Matt Ezell 2022-11-08 06:45:41 MST
(In reply to Marcin Stolarek from comment #9)
> Matt,
> 
> Not a part of SchedMD expertise, but did you consider using dnsmasq as a DNS
> cache server on the machine where slurmctld runs? It should be able to work
> as a primary nameserver in resolve.conf using /etc/hosts and 2nd etc.
> servers for other hosts resolution. As far as I remember it reads the whole
> /etc/hosts file on startup.
> 
> Let me know what you think or what the test shows - if you can experiment in
> the environment.
> 
> cheers,
> Marcin

Yes, I tried this and included the results in the initial report. It was no faster than nss reading /etc/hosts without nscd each getaddrinfo call
Comment 11 Marcin Stolarek 2022-11-08 23:09:56 MST
>Yes, I tried this and included the results in the initial report. It was no faster than nss reading /etc/hosts without nscd each getaddrinfo call

Ah, sorry I was looking into our code after reading your message and it kind of disappeared from memory.

Thinking about potential improvements what do you get when you run something like:

>#!/bin/bash 
>for j in $(seq 1 5)
>do
> ./gai $j 5 &
>done
>time wait

with the test code modified like:
>29c29
>< int main() {
>---
>> int main(int argc, char **argv) {
>33a34,35
>> 	int start = atoi(argv[1]);
>> 	int j = atoi(argv[2]);
>35c37
><         for (i=1;i<=10496;i++) {
>---
>>         for (i=start;i<=10496;i+=j) {

If you can please check this with all back-end configurations.

cheers,
Marcin
Comment 12 Matt Ezell 2022-11-09 08:48:22 MST
With "stock" nscd, 5 partitioned threads of gai showed nscd at 500%cpu and gives approximately 5x performance:

[root@fm2.frontier ~]# time ./gai 1 1                                                                                                                                                                                                                                                                                                                         real    2m40.495s                                                                                                                                                              user    0m0.229s                                                                                                                                                               sys     0m0.670s                                                                                                                                                               [root@fm2.frontier ~]# time ./gai 1 1                                                                                                                                                                                                                                                                                                                         real    0m0.136s                                                                                                                                                               user    0m0.024s                                                                                                                                                               sys     0m0.112s                                                                                                                                                               [root@fm2.frontier ~]# nscd -i hosts                                                                                                                                           [root@fm2.frontier ~]# time ./rungai.sh                                                                                                                                                                                                                                                                                                                       real    0m31.707s                                                                                                                                                              user    0m0.250s                                                                                                                                                               sys     0m0.871s                                                                                                                                                               [root@fm2.frontier ~]# time ./rungai.sh                                                                                                                                        
real    0m0.060s
user    0m0.045s
sys     0m0.225s



Disabling nscd shows just under a 5x improvement:
[root@fm2.frontier ~]# systemctl stop nscd                                                                                                                                                                                                                                                                                            [root@fm2.frontier ~]# time ./gai 1 1                                                                                                                                                                                                                                                                                                                         real    1m25.946s                                                                                                                                                              user    1m18.923s                                                                                                                                                              sys     0m7.024s                                                                                                                                                               [root@fm2.frontier ~]# time ./rungai.sh                                                                                                                                                                                                                                                                                                                       real    0m19.469s                                                                                                                                                              user    1m23.918s                                                                                                                                                              sys     0m11.780s 




dnsmasq is single-threaded, so we don't get the same 5x speedup, but we still see some improvement from parallel requests:
[root@fm2.frontier ~]# time ./gai 1 1                                                                                                                                                                                                                                                                                                                         real    1m18.248s                                                                                                                                                              user    0m0.454s                                                                                                                                                               sys     0m1.974s                                                                                                                                                               [root@fm2.frontier ~]# time ./rungai.sh                                                                                                                                                                                                                                                                                                                       real    0m42.367s                                                                                                                                                              user    0m0.811s                                                                                                                                                               sys     0m2.349s  




I'm not sure how hard it would be to implement parallel lookups in Slurm due to coarse-grained locking, but it does seem to help significantly in certain scenarios.
Comment 16 Marcin Stolarek 2022-11-09 12:40:43 MST
Matt,

Looking back at dnsmasq, did you try to add hosts to its configuration? Maybe it will do the job?

cheers,
Marcin
Comment 19 Marcin Stolarek 2022-11-10 07:18:13 MST
Matt,

I'm sending a patch adding the parallelism to _validate_slurmd_addr to our QA team. Do you have a test system where you can give it a try?

cheers,
Marcin
Comment 20 Matt Ezell 2022-11-10 07:47:05 MST
(In reply to Marcin Stolarek from comment #19)
> Matt,
> 
> I'm sending a patch adding the parallelism to _validate_slurmd_addr to our
> QA team. Do you have a test system where you can give it a try?
> 
> cheers,
> Marcin

I have a place where I can test functionality, but it's not large enough that it sees the problem with slow lookups.
Comment 21 Matt Ezell 2022-11-10 07:48:35 MST
(In reply to Marcin Stolarek from comment #16)
> Matt,
> 
> Looking back at dnsmasq, did you try to add hosts to its configuration?
> Maybe it will do the job?
> 
> cheers,
> Marcin


The default config file shipped with sles has:
# If you don't want dnsmasq to read /etc/hosts, uncomment the
# following line.
#no-hosts

which implies to me that it is already reading in the hosts file.
Comment 31 Marcin Stolarek 2022-12-02 01:29:09 MST
Matt,

We've merged a commit that adds support for parallel nodes address validation in multiple threads[1].  It adds SlurmctldParameters=validate_nodeaddr_threads=X, where the default for X is 1.

Are you able to apply that locally and check if it works for you? It is in master branch - Slurm 23.02 to be.

cheers,
Marcin
[1]https://github.com/SchedMD/slurm/commit/6c6cdbe62315741d9f9d5858bb42959b54bf707e
Comment 32 Matt Ezell 2022-12-05 13:18:18 MST
(In reply to Marcin Stolarek from comment #31)
> Matt,
> 
> We've merged a commit that adds support for parallel nodes address
> validation in multiple threads[1].  It adds
> SlurmctldParameters=validate_nodeaddr_threads=X, where the default for X is
> 1.
> 
> Are you able to apply that locally and check if it works for you? It is in
> master branch - Slurm 23.02 to be.

I'm out of the office for another week, but I should be able to test it if it's not too hard to backport to 22.05.  Thanks
Comment 33 Marcin Stolarek 2023-01-01 22:13:36 MST
Matt,

Happy New Year!

I was wondering if you had some time to take a look at the improvement introduced in commit 6c6cdbe6, did you?

cheers,
Marcin
Comment 34 Matt Ezell 2023-01-04 09:52:15 MST
(In reply to Marcin Stolarek from comment #33)
> Happy New Year!
> 
> I was wondering if you had some time to take a look at the improvement
> introduced in commit 6c6cdbe6, did you?

Happy new year! Unfortunately I have not gotten to it yet - I will try to get to it this week.

Thanks,
~Matt
Comment 35 Matt Ezell 2023-01-04 10:29:02 MST
(In reply to Matt Ezell from comment #34)
> (In reply to Marcin Stolarek from comment #33)
> > Happy New Year!
> > 
> > I was wondering if you had some time to take a look at the improvement
> > introduced in commit 6c6cdbe6, did you?
> 
> Happy new year! Unfortunately I have not gotten to it yet - I will try to
> get to it this week.
> 
> Thanks,
> ~Matt

It turns out 22.05 is missing list_t (43761a68cc0f56b4e64a3f9767134b63ca05e9a4) so this patch is not straightforward to backport. I don't think I can put master on Frontier at this time.
Comment 36 Marcin Stolarek 2023-01-05 04:48:21 MST
Created attachment 28340 [details]
patch rebased to 22.05 (for test purpose)

The attached patch contains the original changes + a commit adjusting it to 22.05 code base.
It's basically:
> -       list_t *nodes = arg;
> +       List nodes = arg;

> -       list_t *nodes = list_create(NULL);
> +       List nodes = list_create(NULL);

sorry for missing that before.

cheers,
Marcin
Comment 37 Matt Ezell 2023-01-06 11:13:15 MST
(In reply to Marcin Stolarek from comment #36)
> Created attachment 28340 [details]
> patch rebased to 22.05 (for test purpose)

Thanks. Verified on Frontier with 32 threads (to match our slurmctld node's cpu count).

[root@slurm1.frontier ~]# nscd -i hosts
[root@slurm1.frontier ~]# time scontrol reconfig

real    0m5.567s
user    0m0.006s
sys     0m0.000s

Previously that would exceed the tcptimeout and fail.
Comment 38 Matt Ezell 2023-01-06 11:33:49 MST
Thanks again for the work. Marking as fixed in 23.02pre1