Improve DNS performance on large clusters #4036

pierresouchay · 2018-04-16T22:40:42Z

Now that #3948 has been merged, TCP DNS queries do not crash when too many services are present.

However, the DNS performance is still very bad when many nodes are registered since it increase dramatically with the number of nodes.

Here is a comment that explain how to test it quickly: #3850 (comment)

After a few records, here are the results on my laptop in consul agent -dev mode:

while true; do http_count=$(curl -fs localhost:8500/v1/catalog/service/redis?pretty|grep '"Node"'|wc -l) ; dns_count=$(dig @localhost -p 8600 SRV redis.service.consul +tcp +short|wc -l); dns_a=$(dig @localhost -p 8600 redis.service.consul +tcp +short|wc -l); echo "HTTP: $http_count ; DNS_SRV: $dns_count ; DNS_A: $dns_a"; sleep 1; done

Around 1300 records

SRV ~80ms
A ~7ms

2018/04/17 00:21:06 [DEBUG] dns: TCP answer to [{redis.service.consul. 33 1}] too large truncated recs:=418/1308, size:=65503/204941
    2018/04/17 00:21:06 [DEBUG] dns: request for name redis.service.consul. type SRV class IN (took 80.247124ms) from client 127.0.0.1:52801 (tcp)
    2018/04/17 00:21:06 [DEBUG] dns: request for name redis.service.consul. type A class IN (took 7.149242ms) from client 127.0.0.1:52805 (tcp)

After 5k records

SRV ~100ms
A ~25ms

2018/04/17 00:36:00 [DEBUG] dns: request for name redis.service.consul. type SRV class IN (took 99.704139ms) from client 127.0.0.1:64822 (tcp)
    2018/04/17 00:36:00 [DEBUG] dns: TCP answer to [{redis.service.consul. 1 1}] too large truncated recs:=1420/5080, size:=65510/234352
    2018/04/17 00:36:00 [DEBUG] dns: request for name redis.service.consul. type A class IN (took 26.310653ms) from client 127.0.0.1:64824 (tcp)

One large part of this behavior is due to the naive method to truncate records when size is too big. Thus, we propose to switch to a binary search in order to find the optimal value.

The text was updated successfully, but these errors were encountered:

Will fix hashicorp#4036 Instead of removing one by one the entries, find the optimal size using binary search. For SRV records, with 5k nodes, duration of DNS lookups is divided by 4 or more.

pierresouchay · 2018-04-16T22:58:30Z

After optimization in #4037

Time needed to get:

SRV records dropped from 100ms to 25ms (divided by 4)
A records from 25ms to 20ms (less impressive, but significant)

Example:

    2018/04/17 00:40:58 [DEBUG] dns: TCP answer to [{redis.service.consul. 33 1}] too large truncated recs:=413/5080, size:=65457/804580
    2018/04/17 00:40:58 [DEBUG] dns: request for name redis.service.consul. type SRV class IN (took 27.502257ms) from client 127.0.0.1:59778 (tcp)
    2018/04/17 00:40:58 [DEBUG] dns: TCP answer to [{redis.service.consul. 1 1}] too large truncated recs:=709/5080, size:=65474/234352
    2018/04/17 00:40:58 [DEBUG] dns: request for name redis.service.consul. type A class IN (took 20.231774ms) from client 127.0.0.1:59780 (tcp)

pierresouchay mentioned this issue Apr 16, 2018

Perform a binary search to find optimal size of DNS responses #4037

Merged

mkeeler closed this as completed in #4037 Apr 23, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve DNS performance on large clusters #4036

Improve DNS performance on large clusters #4036

pierresouchay commented Apr 16, 2018

pierresouchay commented Apr 16, 2018

Improve DNS performance on large clusters #4036

Improve DNS performance on large clusters #4036

Comments

pierresouchay commented Apr 16, 2018

Around 1300 records

After 5k records

pierresouchay commented Apr 16, 2018