| Summary: | Timeouts from hostname resolution | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Matt Ezell <ezellma> |
| Component: | slurmctld | Assignee: | Marcin Stolarek <cinek> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | CC: | brian.gilmer, tim |
| Version: | 22.05.5 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | ORNL-OLCF | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | 23.02pre1 | Target Release: | --- |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: |
getaddrinfo reproducer
perf.data perf.data.tar.bz2 patch rebased to 22.05 (for test purpose) |
||
|
Description
Matt Ezell
2022-11-04 10:40:50 MDT
Matt,
Could you please share the source of "gai" and collect the profile of slurmctld while recreating the issue?
The commands should look like:
>perf record -s --call-graph dwarf -p `pidof slurmctld` sleep 300
>perf archive perf.data
then send both perf.data.tar.bz2 and perf.data (.tar.bz2 is not a compressed version of perf.data).
I'd like to find a specific call traces responsible for this slowness.
cheers,
Marcin
Created attachment 27617 [details]
getaddrinfo reproducer
This reproducer duplicates slurmctld's logic to resolve hostnames to IP addresses. It resolves all of Frontier's compute nodes.
Created attachment 27621 [details]
perf.data
Please un-gzip before attempting to use
Created attachment 27622 [details]
perf.data.tar.bz2
The top section from the perf report is:
- 43.77% 0.20% :7164 slurmctld ◆
- 43.57% read_slurm_conf ▒
- 25.79% _validate_slurmd_addr (inlined) ▒
- 25.70% slurm_set_addr ▒
- 25.55% get_addr_info ▒
- 24.89% __GI_getaddrinfo (inlined) ▒
- 14.75% gaih_inet.constprop.6 ▒
- 13.79% __nscd_getai ▒
- 9.88% __nscd_open_socket ▒
+ 5.45% open_socket ▒
+ 3.65% wait_on_socket ▒
+ 0.77% __GI___read (inlined) ▒
+ 2.45% __GI___close_nocancel (inlined) ▒
+ 0.89% __readall ▒
+ 9.69% __check_pf ▒
+ 0.59% snprintf (inlined) ▒
+ 4.55% restore_node_features ▒
+ 4.05% _sync_jobs_to_conf ▒
+ 3.73% set_cluster_tres ▒
+ 2.16% _purge_old_node_state ▒
+ 1.08% _build_bitmaps_pre_select (inlined) ▒
+ 0.65% select_g_node_init ▒
0.57% _set_features ▒
Matt, Not a part of SchedMD expertise, but did you consider using dnsmasq as a DNS cache server on the machine where slurmctld runs? It should be able to work as a primary nameserver in resolve.conf using /etc/hosts and 2nd etc. servers for other hosts resolution. As far as I remember it reads the whole /etc/hosts file on startup. Let me know what you think or what the test shows - if you can experiment in the environment. cheers, Marcin (In reply to Marcin Stolarek from comment #9) > Matt, > > Not a part of SchedMD expertise, but did you consider using dnsmasq as a DNS > cache server on the machine where slurmctld runs? It should be able to work > as a primary nameserver in resolve.conf using /etc/hosts and 2nd etc. > servers for other hosts resolution. As far as I remember it reads the whole > /etc/hosts file on startup. > > Let me know what you think or what the test shows - if you can experiment in > the environment. > > cheers, > Marcin Yes, I tried this and included the results in the initial report. It was no faster than nss reading /etc/hosts without nscd each getaddrinfo call >Yes, I tried this and included the results in the initial report. It was no faster than nss reading /etc/hosts without nscd each getaddrinfo call Ah, sorry I was looking into our code after reading your message and it kind of disappeared from memory. Thinking about potential improvements what do you get when you run something like: >#!/bin/bash >for j in $(seq 1 5) >do > ./gai $j 5 & >done >time wait with the test code modified like: >29c29 >< int main() { >--- >> int main(int argc, char **argv) { >33a34,35 >> int start = atoi(argv[1]); >> int j = atoi(argv[2]); >35c37 >< for (i=1;i<=10496;i++) { >--- >> for (i=start;i<=10496;i+=j) { If you can please check this with all back-end configurations. cheers, Marcin With "stock" nscd, 5 partitioned threads of gai showed nscd at 500%cpu and gives approximately 5x performance: [root@fm2.frontier ~]# time ./gai 1 1 real 2m40.495s user 0m0.229s sys 0m0.670s [root@fm2.frontier ~]# time ./gai 1 1 real 0m0.136s user 0m0.024s sys 0m0.112s [root@fm2.frontier ~]# nscd -i hosts [root@fm2.frontier ~]# time ./rungai.sh real 0m31.707s user 0m0.250s sys 0m0.871s [root@fm2.frontier ~]# time ./rungai.sh real 0m0.060s user 0m0.045s sys 0m0.225s Disabling nscd shows just under a 5x improvement: [root@fm2.frontier ~]# systemctl stop nscd [root@fm2.frontier ~]# time ./gai 1 1 real 1m25.946s user 1m18.923s sys 0m7.024s [root@fm2.frontier ~]# time ./rungai.sh real 0m19.469s user 1m23.918s sys 0m11.780s dnsmasq is single-threaded, so we don't get the same 5x speedup, but we still see some improvement from parallel requests: [root@fm2.frontier ~]# time ./gai 1 1 real 1m18.248s user 0m0.454s sys 0m1.974s [root@fm2.frontier ~]# time ./rungai.sh real 0m42.367s user 0m0.811s sys 0m2.349s I'm not sure how hard it would be to implement parallel lookups in Slurm due to coarse-grained locking, but it does seem to help significantly in certain scenarios. Matt, Looking back at dnsmasq, did you try to add hosts to its configuration? Maybe it will do the job? cheers, Marcin Matt, I'm sending a patch adding the parallelism to _validate_slurmd_addr to our QA team. Do you have a test system where you can give it a try? cheers, Marcin (In reply to Marcin Stolarek from comment #19) > Matt, > > I'm sending a patch adding the parallelism to _validate_slurmd_addr to our > QA team. Do you have a test system where you can give it a try? > > cheers, > Marcin I have a place where I can test functionality, but it's not large enough that it sees the problem with slow lookups. (In reply to Marcin Stolarek from comment #16) > Matt, > > Looking back at dnsmasq, did you try to add hosts to its configuration? > Maybe it will do the job? > > cheers, > Marcin The default config file shipped with sles has: # If you don't want dnsmasq to read /etc/hosts, uncomment the # following line. #no-hosts which implies to me that it is already reading in the hosts file. Matt, We've merged a commit that adds support for parallel nodes address validation in multiple threads[1]. It adds SlurmctldParameters=validate_nodeaddr_threads=X, where the default for X is 1. Are you able to apply that locally and check if it works for you? It is in master branch - Slurm 23.02 to be. cheers, Marcin [1]https://github.com/SchedMD/slurm/commit/6c6cdbe62315741d9f9d5858bb42959b54bf707e (In reply to Marcin Stolarek from comment #31) > Matt, > > We've merged a commit that adds support for parallel nodes address > validation in multiple threads[1]. It adds > SlurmctldParameters=validate_nodeaddr_threads=X, where the default for X is > 1. > > Are you able to apply that locally and check if it works for you? It is in > master branch - Slurm 23.02 to be. I'm out of the office for another week, but I should be able to test it if it's not too hard to backport to 22.05. Thanks Matt, Happy New Year! I was wondering if you had some time to take a look at the improvement introduced in commit 6c6cdbe6, did you? cheers, Marcin (In reply to Marcin Stolarek from comment #33) > Happy New Year! > > I was wondering if you had some time to take a look at the improvement > introduced in commit 6c6cdbe6, did you? Happy new year! Unfortunately I have not gotten to it yet - I will try to get to it this week. Thanks, ~Matt (In reply to Matt Ezell from comment #34) > (In reply to Marcin Stolarek from comment #33) > > Happy New Year! > > > > I was wondering if you had some time to take a look at the improvement > > introduced in commit 6c6cdbe6, did you? > > Happy new year! Unfortunately I have not gotten to it yet - I will try to > get to it this week. > > Thanks, > ~Matt It turns out 22.05 is missing list_t (43761a68cc0f56b4e64a3f9767134b63ca05e9a4) so this patch is not straightforward to backport. I don't think I can put master on Frontier at this time. Created attachment 28340 [details] patch rebased to 22.05 (for test purpose) The attached patch contains the original changes + a commit adjusting it to 22.05 code base. It's basically: > - list_t *nodes = arg; > + List nodes = arg; > - list_t *nodes = list_create(NULL); > + List nodes = list_create(NULL); sorry for missing that before. cheers, Marcin (In reply to Marcin Stolarek from comment #36) > Created attachment 28340 [details] > patch rebased to 22.05 (for test purpose) Thanks. Verified on Frontier with 32 threads (to match our slurmctld node's cpu count). [root@slurm1.frontier ~]# nscd -i hosts [root@slurm1.frontier ~]# time scontrol reconfig real 0m5.567s user 0m0.006s sys 0m0.000s Previously that would exceed the tcptimeout and fail. Thanks again for the work. Marking as fixed in 23.02pre1 |