Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve DNS performance on large clusters #4036

Closed
pierresouchay opened this issue Apr 16, 2018 · 1 comment · Fixed by #4037
Closed

Improve DNS performance on large clusters #4036

pierresouchay opened this issue Apr 16, 2018 · 1 comment · Fixed by #4037

Comments

@pierresouchay
Copy link
Contributor

Now that #3948 has been merged, TCP DNS queries do not crash when too many services are present.

However, the DNS performance is still very bad when many nodes are registered since it increase dramatically with the number of nodes.

Here is a comment that explain how to test it quickly: #3850 (comment)

After a few records, here are the results on my laptop in consul agent -dev mode:

while true; do http_count=$(curl -fs localhost:8500/v1/catalog/service/redis?pretty|grep '"Node"'|wc -l) ; dns_count=$(dig @localhost -p 8600 SRV redis.service.consul +tcp +short|wc -l); dns_a=$(dig @localhost -p 8600 redis.service.consul +tcp +short|wc -l); echo "HTTP: $http_count ; DNS_SRV: $dns_count ; DNS_A: $dns_a"; sleep 1; done

Around 1300 records

SRV ~80ms
A ~7ms

2018/04/17 00:21:06 [DEBUG] dns: TCP answer to [{redis.service.consul. 33 1}] too large truncated recs:=418/1308, size:=65503/204941
    2018/04/17 00:21:06 [DEBUG] dns: request for name redis.service.consul. type SRV class IN (took 80.247124ms) from client 127.0.0.1:52801 (tcp)
    2018/04/17 00:21:06 [DEBUG] dns: request for name redis.service.consul. type A class IN (took 7.149242ms) from client 127.0.0.1:52805 (tcp)

After 5k records

SRV ~100ms
A ~25ms

2018/04/17 00:36:00 [DEBUG] dns: request for name redis.service.consul. type SRV class IN (took 99.704139ms) from client 127.0.0.1:64822 (tcp)
    2018/04/17 00:36:00 [DEBUG] dns: TCP answer to [{redis.service.consul. 1 1}] too large truncated recs:=1420/5080, size:=65510/234352
    2018/04/17 00:36:00 [DEBUG] dns: request for name redis.service.consul. type A class IN (took 26.310653ms) from client 127.0.0.1:64824 (tcp)

One large part of this behavior is due to the naive method to truncate records when size is too big. Thus, we propose to switch to a binary search in order to find the optimal value.

pierresouchay added a commit to pierresouchay/consul that referenced this issue Apr 16, 2018
Will fix hashicorp#4036

Instead of removing one by one the entries, find the optimal
size using binary search.

For SRV records, with 5k nodes, duration of DNS lookups is
divided by 4 or more.
@pierresouchay
Copy link
Contributor Author

After optimization in #4037

Time needed to get:

  • SRV records dropped from 100ms to 25ms (divided by 4)
  • A records from 25ms to 20ms (less impressive, but significant)

Example:

    2018/04/17 00:40:58 [DEBUG] dns: TCP answer to [{redis.service.consul. 33 1}] too large truncated recs:=413/5080, size:=65457/804580
    2018/04/17 00:40:58 [DEBUG] dns: request for name redis.service.consul. type SRV class IN (took 27.502257ms) from client 127.0.0.1:59778 (tcp)
    2018/04/17 00:40:58 [DEBUG] dns: TCP answer to [{redis.service.consul. 1 1}] too large truncated recs:=709/5080, size:=65474/234352
    2018/04/17 00:40:58 [DEBUG] dns: request for name redis.service.consul. type A class IN (took 20.231774ms) from client 127.0.0.1:59780 (tcp)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant